Comprehensive DDoS Attack Classification Using Machine Learning Algorithms

2022-11-10 02:29OlgaUssatovaAidanaZhumabekovaYenlikBegimbayevaEricMatsonandNikitaUssatov
Computers Materials&Continua 2022年10期

Olga Ussatova,Aidana Zhumabekova,Yenlik Begimbayeva,Eric T.Matson and Nikita Ussatov

1Al-Farabi Kazakh National University,Almaty,050040,Kazakhstan

2Institute of Information and Computational Technologies,Almaty,050010,Kazakhstan

3Satbayev University,Almaty,050013,Kazakhstan

4Purdue University,West Lafayette,47907,IN,USA

5Turan University,Almaty,050013,Kazakhstan

Abstract:The fast development of Internet technologies ignited the growth of techniques for information security that protect data,networks,systems,and applications from various threats.There are many types of threats.The dedicated denial of service attack (DDoS) is one of the most serious and widespread attacks on Internet resources.This attack is intended to paralyze the victim’s system and cause the service to fail.This work is devoted to the classification of DDoS attacks in the special network environment called Software-Defined Networking(SDN)using machine learning algorithms.The analyzed dataset included instances of two classes:benign and malicious.As the dataset contained twenty-two features,the feature selection techniques were required for dimensionality reduction.In these experiments,the Information gain,the Chi-square,and the F-test were applied to decrease the number of features to ten.The classes were also not completely balanced,so undersampling,oversampling,and synthetic minority oversampling(SMOTE) techniques were used to balance classes equally.The previous research works observed the classification of DDoS attacks applying various feature selection techniques and one or more machine learning algorithms.Still,they did not pay much attention to classifying the combinations of feature selection and balancing methods with different machine learning algorithms.This work is devoted to the classification of datasets with eight machine learning algorithms:naïve Bayes,logistic regression,support vector machine,k-nearest neighbors,decision tree,random forest,XGBoost,and CatBoost.In the experimental results,the Information gain and F-test feature selection methods achieved better performance with all eight ML algorithms than with the Chi-square technique.Furthermore,the accuracy values of the oversampled and SMOTE datasets were higher than that of the undersampled and imbalanced datasets.Among machine learning algorithms,the accuracy of support vector machine,logistic regression,and naïve Bayes fluctuates between 0.59 and 0.75,while decision tree,random forest,XGBoost,and CatBoost allowed achieving values around 0.99 and 1.00 with all feature selection and class balancing techniques among all the algorithms.

Keywords:Internet security;networks;systems;DDoS;software-defined networking;feature selection;class balancing;machine learning;XGBoost;CatBoost

1 Introduction

Information security is a set of technologies and management methods required to guarantee the integrity,confidentiality,and availability of information data.Information security aims to protect the information,systems,networks,and applications from accidental or deliberate threats[1].Today’s online world is full of malware attacks[2],threats[3],cyberattacks[4],and scams[5].However,implementing effective measures to protect and preserve information from such various threats is difficult since the variety of threats and attacks that can potentially damage information and computer networks is growing every day and uses more sophisticated attack methods.Fig.1 shows the impact of information threats on information security criteria.

Figure 1:Influence of information threats on information security criteria

A common menace is a possibility of compromising information security in any form[6].Various threats,attacks,and methods are used to access and damage the most important data.Phishing attacks,one of the most common threats over the years,pose a serious threat to information security,and the number of these threats is growing[7].The main purpose of a phishing attack is to gain access to consumers’confidential personal data and financial information through various technical and social engineering methods[8].The number of different methods and types of phishing methods is growing when the information technologies actively progress.Another popular type of software for unauthorized access to consumer information is spyware.These unwanted programs may compromise or steal the customer’s information privacy.Spyware is not a direct attack by hackers but an unauthorized,covert installation on a computer.Such spyware provides remote access to the user’s computer and information about its activities[9].One of the most serious threats to attacks on Internet resources is a denial of service(DoS)attack.This type of attack does not directly damage information but disrupts services by attempting to disable the normal operation of a computer or network.The most advanced type of DoS attack is dedicated denial of services (DDoS).Unlike a DoS attack,a DDoS attack is performed from multiple devices and addresses at once,as shown in Fig.2.

Figure 2:DDoS architecture

A DDoS attack is a hacker attack that paralyzes websites in a short period of time[10].DDoS attacks are not intended to disrupt the victim’s system but cause the service to fail[11].The most serious malware among industrial hazards is called Flame.This program is a type of threat designed to steal and attack various valuable data.It is highly feasible because the Flame program has a variety of features,such as keyboard and network traffic control,voice recording,and screenshots.It uses a local area network(LAN)or USB storage to distribute it to different systems.Information security tools need to be strengthened to ensure information security in order to protect against various attacks,such as the threats described above.The DDoS attacks became common in the traditional and Software-Defined Networking(SDN)environments.In the traditional environment,they are directed on servers,while the attacks are executed on the controller in the SDN.The controller fails to provide services for the forwarding data packages under a DDoS attack in the SDN.At the same time,in traditional networks,the server completely stops providing services for users.

It is necessary to choose information security tools,considering the compatibility of their functions and the effectiveness of their use.Excessive reinsurance can lead to high costs.A variety of methods and tools are used to identify threats.Machine learning (ML) is one of the most important and effective methods[12].This method is used to identify information system objects more accurately.The ML method is now popular for many real-time tasks using sophisticated algorithms.This field is closely related to computational statistics,making predictions using complex algorithms based on statistical data.ML uses data to determine the most effective algorithm based on the data’s volume,quality,and nature.The development of ML algorithms for detecting Internet threats has demonstrated very effective results for classifying normal and malicious traffic in networks.Nevertheless,ML algorithms show good results with datasets that include a reasonable set of features.

Furthermore,new effective methods called neural networks started to be used to identify different kinds of threats[13-15].Neural networks[13]are an ML discipline that mimics the way neurons work in the human brain.Neural networks[14]are specially proposed to determine the typical characteristics of system users and their statistically significant deviations.Neural networks consist of input,latent,and output nodes,and they represent the information that enters the network.Input nodes are associated with hidden nodes,and these input nodes receive the information passed to them.Each hidden node has a limit:It is activated if all aggregate inputs reach a certain value.

Despite the beginning of the use of ML and neural networks methods in the threat identification systems for the last years,the existing articles did not observe all of the aspects of DDoS attacks,all kinds of environments in which the network devices operate.Moreover,it is important to use four metrics such as accuracy,precision,recall,and F1-score to measure the efficiency of classification algorithms.Unfortunately,many research works display only one or two metrics in the experimental results section.In applying ML algorithms to DDoS threat classification,the feature selection techniques play an important role in the choice of the best features in the dataset.The list of the most popular methods includes the Information Gain (IG),Chi-square Test,F-test,Fisher’s Score,Correlation Coefficient,Variance Threshold,etc.In the experimental part of this work,three feature selection techniques (IG,Chi-square,and F-test) are chosen for the evaluation.Another important problem that was not touched on in the previous research works of the DDoS threats classification is the imbalanced datasets.This problem occurs when there are unequal classes in the training dataset.This case decreases the values of evaluation metrics,and it is generally not a good situation in classification problems.Random oversampling,random undersampling,and synthetic minority oversampling(SMOTE)techniques are applied to solve this problem and make classes equal in size.

This paper is devoted to the supervised ML-based approach that is very rapid in computations and exhibits promising classification results.The rest of the paper is organized in the following way:Section 2 gives the literature review.Then the analyzed dataset,data scaling and feature selection techniques,class balancing,and ML algorithms are presented in Section 3.The experimental results,their analysis,and discussion are provided in Section 4.Finally,in Section 5,we briefly describe all the steps taken,suggest the best ML models,and outline directions for future research.

2 Literature Review

Cyber attackers usually update the software they use on a daily basis.Therefore,risk detection systems are developed daily to combat malware.To this end,there is a lot of literature research,and new research is being done to improve the performance of protection systems.In addition,there is a significant amount of research on identifying hazards using various ML methods.Therefore,this section focuses on observing ML and neural networks techniques to mitigate the existing threats.The research works devoted to the internet threats problems are shown in Tab.1.

Table 1:Research works and their features

Table 1:Continued

Table 1:Continued

The DDoS attack consists of a large number of incoming packets that overload network resources.The server generally starts to drop the packets and becomes unavailable for other incoming legitimate packets for a definite period.As modern computer networks are commonly represented by the following list of main network devices such as hubs,switches,and routers,network management remains a challenging task.In order to overcome these difficulties,a new network approach,called Software-Defined Networking (SDN),where forwarding hardware is decoupled from the control decisions,is utilized.In this approach,the network functionalities are centralized in software-based controllers,and network devices can be programmed with an open interface.In[22],DDoS attacks in a Software-Defined Networking(SDN)environment are evaluated with the use of ML algorithms.RF,k-NN,and SVM algorithms are very efficient and show the values of accuracy,precision,recall,and F1-score above 98%.The observation of DDoS attacks in the SDN environment is also done in[23],where six characteristics of the switch flow table are extracted.A DDoS attack model is built with the application of the SVM classification algorithm.The SDN significantly simplifies network management and makes it very efficient.In the experimental results,this model achieves an accuracy of 95.24%with a small flow.[24]proposes a deep learning neural network model for detecting DDoS attacks with such performance metrics as an average delay,packet loss,packet delivery ratio,and throughput.The KDD Cup,SSE,and mixed datasets are utilized for the analysis of this model’s performance.The suggested technique correspondingly shows 98.9%,99%,and 98.1%accuracy values for the mentioned datasets.[25]uses a real-time solution to detect DDoS attacks in hardware.CAIDA DDoS,MIT DARPA,and TUIDS datasets are used to evaluate the effectiveness of the proposed method.The experimental results demonstrate a very high accuracy of 99%for all three datasets with less than one microsecond to identify an incoming attack.

3 Methodology

This work performs the classification of DDoS attacks[26]in the SDN environment[27].In the first step,the required dataset is chosen and thoroughly analyzed.In the second step,the categorical features in the dataset are encoded into numerical form.The optimal data scaling and feature selection techniques used for the dataset’s normalization and suitable feature selection for the training model step are described in the third and fourth steps.Then the undersampling,oversampling,and SMOTE class balancing techniques are explained in detail.An important step of class balancing is realized,making classes equal in size.Finally,the principles of ML algorithms used in the experimental part are explained.

3.1 Dataset

There is a number of datasets containing information about various DDoS attacks[28,29]online,but the current work focuses on processing data for the attacks on the SDN.The corresponding dataset is shared by the following link by its authors[30].This dataset includes benign TCP,UDP,and ICMP traffic and malicious traffic that presents the collection of TCP Syn,UDP flood,and ICMP attacks.It consists of 23 features extracted from switches.The list of the features and their descriptions are presented in Tab.2.The dataset includes 104345 rows with 63335 benign and 40504 malicious labels(Fig.3).

Table 2:Features of the dataset

Table 2:Continued

Figure 3:Benign and malicious classes

3.2 Label Encoding

In the presented dataset,some of the features like the source IP address,the destination IP address,and the protocol are categorical.It is necessary to transform them into a numerical form before applying the scaling step.Therefore,the categorical features are replaced with a numerical value between 0 and the number of classes minus 1.

3.3 Data Scaling

The values of the dataset’s features are measured at different scales,and they do not contribute equally to the model fitting.Therefore,if an ML model is trained with these features unchanged,it can create a bias,making the model unprecise.Normalization techniques[31]are used to deal with this problem.Mean normalization,Min-Max normalization,and Standardization are the most frequently used scaling methods.

Mean normalization is calculated by the following formula

3.4 Feature Selection Techniques

Building an ML model almost rarely requires the use of all features.When redundant features are added to the model,it increases the complexity,the cost,and the running time.Therefore,feature selection techniques[32]such as the Information gain(IG),the Chi-square,and the F-test are used to overcome this problem.

The IG measures a connection between each feature in the context of the target feature.It is presented by the following formula

The Chi-square is utilized for testing the independence of two events.Having two features is necessary to get countOand expected countE.The Chi-square estimates howEandOdeviate from each other.

where,Ois observed values,Eis expected values,andcis degrees of freedom.

The F-test is a statistical test that computes the ratio between variances values.The results of the test are effectively used for feature selection.The F-test is calculated by the formula

where,MST is mean square treatments,and MSE is a mean square error.

3.5 Class Balancing

A class imbalance is a term that determines that the number of elements in one class is higher than in the other class.The class imbalance is commonly a big problem in ML because it increases the model’s accuracy by straightly labeling all elements as a majority class.However,it performs weakly in classifying the other class,and the values of precision,recall,and F1-score become lower than the accuracy.Generally,the class imbalance appears in such domains as spam classification,fraud detection,disease screening,and DDoS attacks.Three effective class balancing techniques called random oversampling,random undersampling (Fig.4),and synthetic minority oversampling(SMOTE)are used to overcome this problem[33].

Figure 4:Class balancing:random undersampling and random oversampling

In random oversampling,elements from the minority class are randomly selected for duplication to make this class equal to the majority class.In random undersampling,the opposite operation is done.Random elements in the majority class are deleted to decrease the size and equalize it with the minority class.The disadvantage of undersampling is the loss of a large part of valuable data.In random oversampling,oppositely,the important information is kept.

SMOTE is another very efficient oversampling technique where the synthetic elements are generated for the minority class.This algorithm focuses on the feature space for generating new elements that are synthesized between the existing ones.The created elements also preserve very valuable information.

3.6 Machine Learning Algorithms

ML text classification has been implemented with the following algorithms:NB,SVM,Logistic regression(LR),k-NN,DT,RF,XGBoost,and CatBoost.These algorithms were chosen because they are considered advanced and widely used for data classification tasks.

An NB classifier[8]uses the Bayes’theorem as a probabilistic model for classification.An important assumption that the features are independent is used here.That is the reason this algorithm is called naïve.The Bayes formula is written below

whereX=(x1,x2,x3,...,xn),andx1,x2,x3,...,xnis a list of features of the dataset.The expansion of the chain rule gives the following formula

As the denominator does not change for all entries in the dataset,it can be removed.

An SVM classifier[11]defines a hyperplane,dividing the input data into several classes.This hyperplane tries to separate the data in the best way.The main objective is to find the hyperplane with the maximum distance between data points of two classes(Fig.5).The hyperplane is defined by the following formula

Figure 5:The hyperplane separating classes

where,d(x,y)is the distance between two points;xiandyiare the feature vectors ofxandypoints correspondingly;nis a length of the feature vector.

A DT is a popular and widely used ML algorithm for data classification.A DT represents a structure with N nodes containing the conditions related to the features of the points in the dataset.First,the points whose feature values satisfy this condition are put to one side of the tree.Otherwise,they are put to the other side of the tree.This process continues while propagating through the whole built tree towards its leaf nodes.An RF(Fig.6)is an ensemble method of DTs[19].Each DT classifies a new data point independently,and the class is defined by the largest number of votes of all trees in the ensemble.

Figure 6:An RF classifier

XGBoost[33]is one of the most advanced ML algorithms released in 2014.It provides a parallel tree boosting and is significantly efficient in performance.One specification of the algorithm is that diversions of ensemble predictions are calculated at each iteration.CatBoost was developed by Yandex in 2017.This algorithm is also based on gradient boosting and focuses on categorical features in a dataset.It is also very fast and effective,allowing GPU usage during the training step.

4 Experiments and Discussion

The experimental part utilized the Python programming language with Scikit-learn,Imbalancelearn,Matplotlib,and Seaborn libraries.

First,the categorical features of the dataset were encoded with the label encoding technique.Then all features were scaled with the Min-Max normalization.The IG,Chi-square,and F-test feature selection techniques were applied to the dataset getting the ten most important features.The dataset was balanced with undersampling,oversampling,and SMOTE techniques,randomly divided into training 70% and testing 30% parts,and classified with eight ML algorithms from Section 3.5.The hold-out split method instead of k-fold cross-validation was chosen because the classification with three feature selection techniques,four balancing methods,and eight ML algorithms requires much time.The cross-validation would take a decent time to run all the experiments and compose all the obtained data together.

The performance was evaluated by accuracy,precision,recall,and F1-score measures[26].

where,TP(true positive) is a correctly classifiedpositiveinstance;TN(true negative) is a correctly classifiednegativeinstance;FP(false positive) is a wrongly classifiedpositiveinstance;FN(false negative)is a wrongly classifiednegativeinstance.

The classification of the dataset with Chi-square,IG,and F-test feature selection methods is presented in Tab.3.

Table 3:Classification of the imbalanced and oversampled datasets

Table 3:Continued

The experimental results showed that the IG and F-test feature selection methods achieved the best performance metrics with all ML algorithms.In addition,the processing of the oversampled dataset gave better results with SVM,LR,and NB ML algorithms than the imbalanced dataset.Among ML models[27],DT,RF,XGBoost,and CatBoost demonstrated significantly better results than NB,SVM,and LR.In most experiments,the accuracy,precision,recall,and F1-score values reached 0.99 and 1.00.

The classification of the dataset with Chi-square,IG,and F-test feature selection methods is presented in Tab.4.

Table 4:Classification of the undersampled and SMOTE datasets

Table 4:Continued

The results of the classification of undersampled and SMOTE datasets proved that the IG and F-test feature selection techniques allowed to achieve superior results than the Chi-square feature selection.DT,RF,k-NN,CatBoost,and XGBoost ML algorithms also classified these datasets better than other algorithms.The experiments support the statements that these algorithms are the most advanced in classifying Internet threats.

The obtained experimental results generally revealed that the imbalanced data classification showed the lowest performance in all three feature selection techniques compared to models trained on the oversampled,undersampled,and SMOTE models.

5 Conclusion

As the number of different threats is growing,and the earlier methods of the systems’protection are becoming less effective and more vulnerable to the various attacks,a need for more advanced methods appeared.One of these most serious threats is a DDoS attack[28]that disables the normal operation of servers,networks,and systems.This work observed DDoS attacks in the SDN network[29],where all the functionalities are centralized in software-based controllers.DDoS attacks are especially dangerous for the SDN network[30,34],and an effective approach is required to detect them accurately and fast.ML algorithms proved to be very useful in revealing malicious traffic in these kinds of networks.

The dataset containing benign and malicious traffic instances was processed and analyzed in the experiments.The IG,Chi-square,and F-test feature selection methods retrieved the most important features.Then three balancing techniques were used to balance the classes,and eight very efficient ML algorithms(NB,SVM,LR,k-NN,DT,RF,XGBoost,and CatBoost)were applied to train the classification models.The classification performance was evaluated by accuracy,precision,recall,and F1-score measures.DT,RF,k-NN,XGBoost,and CatBoost ML algorithms showed the best results with all feature selection and class balancing techniques with the accuracy,precision,recall,and F1-score values of 0.99 and 1.00.

In future works,the ML classifiers will be tested on the datasets containing different kinds of Internet threats such as smurf,phishing,man-in-the-middle,SQL injection,password attacks,and others.

Acknowledgement:We would like to thank colleagues for their support.

Funding Statement:The authors received no specific funding for this study.

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.