Deep Learning-Based Hybrid Intelligent Intrusion Detection System

2021-12-14 09:59MuhammadAshfaqKhanandYangwooKim
Computers Materials&Continua 2021年7期

Muhammad Ashfaq Khan and Yangwoo Kim

1Department of Information,and Communication Engineering,Dongguk University,Seoul,100-715,Korea

2Department of Electronics Engineering,IoT and Big-Data Research Center,Incheon National University,Incheon,Korea

Abstract: Machine learning (ML) algorithms are often used to design effective intrusion detection(ID)systems for appropriate mitigation and effective detection of malicious cyber threats at the host and network levels.However,cybersecurity attacks are still increasing.An ID system can play a vital role in detecting such threats.Existing ID systems are unable to detect malicious threats,primarily because they adopt approaches that are based on traditional ML techniques,which are less concerned with the accurate classifcation and feature selection.Thus,developing an accurate and intelligent ID system is a priority.The main objective of this study was to develop a hybrid intelligent intrusion detection system (HIIDS) to learn crucial features representation effciently and automatically from massive unlabeled raw network traffc data.Many ID datasets are publicly available to the cybersecurity research community.As such, we used a spark MLlib (machine learning library)-based robust classifer, such as logistic regression (LR), extreme gradient boosting(XGB) was used for anomaly detection, and a state-of-the-art DL, such as a long short-term memory autoencoder (LSTMAE) for misuse attack was used to develop an effcient and HIIDS to detect and classify unpredictable attacks.Our approach utilized LSTM to detect temporal features and an AE to more effciently detect global features.Therefore, to evaluate the effcacy of our proposed approach,experiments were conducted on a publicly existing dataset,the contemporary real-life ISCX-UNB dataset.The simulation results demonstrate that our proposed spark MLlib and LSTMAE-based HIIDS signifcantly outperformed existing ID approaches,achieving a high accuracy rate of up to 97.52% for the ISCX-UNB dataset respectively 10-fold crossvalidation test.It is quite promising to use our proposed HIIDS in real-world circumstances on a large-scale.

Keywords: Machine learning; intrusion detection system; deep learning;spark MLlib; LSTM; big data

1 Introduction

The intrusion detection (ID) system is a renowned solution for detecting malicious activities in a network.The types of malicious network attacks have grown exponentially, and the ID system has become an essential component of defense in addition to network security infrastructure.In 1931, John Anderson published the frst signifcant paper on ID, Computer Security surveillance,and threat monitoring [1].An ID system usually monitors all internal and external packets of a network to detect whether a packet has a sign of intrusion.A well-made ID system can determine the properties of numerous malicious activities and automatically respond to them by sending cautions.

In general, there are three common ID system classes; these classes are based on detection approaches.The frst class is the signature-based system (SBS), which includes the misuse detection technique.The second is the anomaly-based system (ABS), also known simply as “anomaly.”The third one is the stateful protocol analysis detection [2].SBS relies upon a pattern matching technique, taking a database of known attack signatures and comparing these to signatures present in the observed data.An alarm goes off when a match is identifed.SBS detects attacks based on existing knowledge; as such, the misuse detection technique is also recognized as a knowledge-based technique.The misuse detection technique features a minimum false alarm rate and maximum accuracy; it cannot, however, identify strange attacks.Similarly, the behaviorbased ID system, also known as ABS, can detect intrusion by matching normal behavior to an abnormal one.The stateful protocol ID method compares the known malicious activities and identifes the eccentricity of protocol activity, taking advantage of both anomaly and signaturebased ID techniques.ID systems can be further categorized into three types according to their architectures:Network-based detection system (NIDS), Host-based detection system (HIDS), and the hybrid approach [3].For a HIDS application, the software is fxed, and the host computer plays an important role in evaluating and monitoring system behavior and event log fles play active roles in ID [4].Unlike a HIDS, which analyzes each host separately, a NIDS analyzes the packets that fow above the network.This gives the NIDS an edge over the HIDS because it can test the whole network with a unique system structure.However, while the NIDS is superior in terms of installation cost and time of application software, it is vulnerable to distribution into a system over the network and affects the complete network.The hybrid IDS combine both the HIDS and NIDS with better-quality security mechanisms.The hybrid system joins the spatial sensors to identify vulnerabilities, which can occur at a particular point or over the whole network.There are two main ID system types, which are defned according to the system’s deployment structure:distributed structure and non-distributed structure.A distributed structure involves several ID subsystems that communicate with each other over an extensive network.In contrast, a non-distributed system can be mounted only at a single, unique location, for example,an open-source snort.

Most approaches currently used in ID systems are unable to deal with the complex and dynamic nature of malicious threats on computer networks.Therefore, effective adaptive methods,such as several ML techniques, can achieve a higher intrusion detection rate (DR), a low false alarm rate (FAR), reasonable communication, and computation costs.There are various customary approaches to intrusion identifcation, including access control, cryptography, and frewalls.These traditional ID techniques have few limitations in fully protecting a system; most notably,when systems are facing a high volume of malicious attacks, DOS and systems can obtain high values of FP and FN attack DR.Recently, numerous researchers have used ML techniques for ID to improve ID rates.Several studies have been done to enhance and apply this method to the ID system.In the present study, we also reviewed several articles that use state-of-the-art ML methods for ID.The ML models were found to have many issues that slow down the training process; these issues included the size of the dataset and the optimal parameters for the most suitable model.These kinds of problems prompted the researchers to look for the most effective methodology.The use of open source clustering computing, such as Spark (a big data analytics system), is one potential solution to such problems.We propose a unique tactic to improve ID system performance.However simple ML approaches are limited, while intrusion methods are expanding and growing increasingly complex.Advanced learning approaches are essential,especially in the analysis of intrusion and feature extraction.Hinton et al.[5] stated that DL has attained great achievements in domains such as image processing, NLP, and weather prediction.

With the growth of cyber-security capabilities, cyber-attacks have risen to the challenge of breaching new security defenses.Considering the possibility of simultaneous attacks, it is vital to select an appropriate action by proactively predicting and evaluating the effects of a specifc security event.Cyber-attacks, mainly in the sphere of large-scale military networks, can have a lethal infuence on security; therefore, numerous tests and extensive research are required preparations.Today, cyber-space is known as the ffth battlespace, following land, air, space, and sea;cyber warfare can affect the military strategies and activities that are associated with national security.Although the military is working to recognize and minimize cyber-attacks, cyber-attacks are consistently on the rise [6,7].It is signifcant to observe the malicious threat that arises in distinct ways to respond effectively to it.The signifcance of cyber-attacks on the infrastructure that should be protected and security policies established.It is not only to analyze cyber threats but also to increase the possibility of more proactive reactions.Numerous studies have been carried out on cyber-attacks modelings, such as the attack graph, attack tree, and cyber kill chain modeling approach [8,9].Notice that previous works on cyber-attack modeling were limited in a large-scale network environment, due to problems such as scalability.Nowadays, cyber threats do not stop with a single attack but come in complex forms that involve numerous kinds of cyberattacks.Besides, novel attacks are constantly emerging.To overcome these challenges, a novel approach to modeling that is fexible enough to adapt to new attacks easily and systematically is required [10].

As mentioned above, misuse and anomaly ID methods have their limitations.Our proposed HIID approach combines the two approaches to overcome their respective shortcomings while maintaining their advantages, which involve improved performance compared to conventional techniques.To increase IDS learning ability and performance, we propose a better-quality ID system that consists of Spark MLlib and state-of-the-art DL approaches, such as LSTMAE.The key contributions of our research may be summarized as follows:

• The development of HIIDS, which relies on Spark MLlib and state-of-the-art DL techniques, such as LSTMAE, which merges both shallow and deep networks to overwhelm their analytical overheads and exploit their benefts.This HIIDS investigates how to solve the class imbalance problem that usually occurs in ISCX ID datasets.

• Further investigation of the packet capture fle directly on Spark; prior studies did not evaluate the raw packet dataset.

• Comparison of the HIIDS with other conventional ML methods.The simulation results demonstrate that the HIIDS approach is highly appropriate for malicious traffc detection.It has higher attack detection accuracy and was found to correctly detect network misuses in 97.52% of the cases through 10-fold cross-validation.

The rest of this paper is organized as follows.The background of ID and related work are briefy reviewed in Section 2.A brief overview of our proposed HIIDS and a detailed description of the dataset that was used for classifcation are provided in Section 3.A simulation of our proposed framework with performance metrics is discussed in Section 4.The paper is concluded with a possible direction for future work described in Section 5.

2 Related Work

Over the last two decades, the application of machine learning (ML) and deep learning(DL) to intrusion detection (ID) systems has been suggested by several researchers.Therefore,various models have been developed for network intrusion detection (NID) using conventional ML techniques.Examples include K nearest neighbors (KNN) as suggested by Khammassi et al.[11]Logistic Regression (LR) as suggested by Moustafa et al.[12] Support vector machine (SVM) as suggested by Khan et al.[13] Random Forest (RF) as suggested by Farnaaz et al.[14] Decision Tree (DT) as suggested by Sindhu et al.[15] Naïve Bayes (NB) as suggested by Buczak et al.[16]and Artifcial Neural Networks (ANN) as suggested by Vincent et al.[17].However, these prior techniques demonstrate inadequate classifcation performance with the maximum false alarm rate(FAR) and low attack detection rate (DR) in an ID system.Kim et al.[18] developed a hybrid system that incorporates misuse and anomaly using supervised ML classifers SVM and DT,respectively, and assessed their hybrid approach using NSL_KDD older data.The authors claimed that the improved attack detection accuracy was owing to the hybrid ID system.Paulauskas et al.[19] developed a novel approach for ID using various weak learners; this is known as the ensemble approach.The weak learners have low malicious detection accuracy.There were some weak learners, such as J48, C5.0, Naïve Bayes, and rule-Based classifers that were used by the authors.Zaman et al.[20] used a better-quality ID algorithm recognized as enhanced support vector decision function (ESVDF) and evaluated their proposed IDS using the DARPA dataset;the proposed IDS was found to be superior to other conventional ID approaches.

Table 1:Summary of the related works using different approaches

Although the above ID approaches have demonstrated decent accuracy up to a certain level,certain improvements, such as decreasing the number of FAR and increasing the ID accuracy,are necessary.In this regard, DL is a powerful technique.DL is a branch of ML that has become progressively dominant in various felds, such as speech recognition and natural language processing (NLP).DL’s popularity is due to its two fundamental characteristics:(a) hierarchical features representations and (b) handling of long-term dependencies of sequential patterns.Today,state-of-the-art DL approaches that are used for NID includes auto-encoders (AE), deep belief networks (DBNs), deep neural networks (DNNs), and restricted Boltzmann machines (RBMs) as well as variants of these approaches.An overview of the state-of-the-art approaches is presented in Tab.1.

DL has shown that its attack detection accuracy in the ID domain effectively exceeds conventional approaches [31].Erfani et al.[32] developed a novel tactic that joined one-class linear SVM with the DBN for ID, evaluating it with various benchmark ID data.Fiore et al.[33] proposed a new technique, discriminative RBM, to learn compressed attributes from data attributes; these compressed attributes are then used for binary classifcation purposes into softmax classifer for benign and malicious network behaviors.Wang et al.[34] presented a DL-based IDS based on AE for detecting network traffc from the raw dataset and achieved a very high ID performance.Javaid et al.[35] used the DNN technique for anomaly detection.Their evaluation based on DNN DL found that the DNN technique is a novel and effective approach for ID in a softwaredefned network (SDNs).Yin et al.[36] introduced a neural network (NN) DL-based NIDS.This DL-based ID was tested using the NSL_KDD dataset; it was found that the DL-based IDS outperformed conventional ML-based classifcation techniques.Khan et al.[37] presented the hybrid DL approach for ID and applied it to real-time ID data.Their simulation outcomes showed that the hybrid DL-based IDS was superior in terms of attack classifcation accuracy and performance.Alrawashdeh et al.[38] proposed a DL-based IDS using the DBN of RBM with four and one hidden layers for attribute reduction purposes; the weights of the DBN were restructured during fne-tuning, and attack classifcation was accomplished using an LR classifer.The developed methodology was evaluated on the benchmark KDD99 data and attained an attack classifcation accuracy of up to 97.9% with a FAR of 0.5%.However, the attack classifcation accuracy as evaluated using this ancient data is not suffcient to show that this a robust approach for NID.Shone et al.[39] proposed a non-symmetric deep AE-based ID and evaluated the proposed framework with the benchmark KDD99 dataset, achieving an attack classifcation accuracy of 97.87% and a FAR of 2.15%.In [40], the authors aimed for a Deep Neural Network(DNN) of 100 hidden units.To improve performance, they utilized a GPU and the KDD99 dataset.The authors proposed that the models of both recurrent neural network (RNN) and long short-term memory (LSTM) are better for enhancing the attack detection accuracy.These ID systems based on DL techniques were found to be superior to traditional approaches; the authors also presented various ideas by joining DL and ML techniques, with the primary goal of developing an effcient and robust ID system.Wang et al.[41] developed a novel approach for ID by combining fuzzy clustering and ANN; they tested a novel hybrid approach on the KDD99 dataset and demonstrated that their hybrid FC-ANN approach outperforms traditional ML approaches in terms of ID.Mukkamala et al.[42] used a hybrid approach by combining the SVM and ANN; they evaluated this approach on the benchmark KDD99 dataset.Here, SVM and ANN were used for classifcation tasks and data patterns, respectively.Various researchers have used the ISCX-2012 ID dataset to conduct suitable system validation.However, there is still much room for enhancements, such as improving attack detection accuracy and reducing FAR [43–48].ID research has been carried out by various scholars for developing both the ABS and SBS using separate classifcation methods.These methods fail to afford the effcient possibility of attack detection, so a hybrid ID system is an important research challenge.MLbased techniques have been mostly used by scientists and engineers to develop an ABS, which can make a model by comparing normal with abnormal behavior and then attempting to classify whether upcoming new packets are “attack” or “normal.” DL is enormously valuable for the ID system because it automatically extracts features of the specifc problem without requiring robust preceding knowledge.The main downside of using the DL model for the ID domain is the extent of the training; obtaining the right model is time-consuming.

The research community has drawn substantial attention to the issue of class imbalance [49].The problem of class imbalance is created by insuffcient data distribution; one class contains most samples, while others contain comparatively few.The classifcation problem becomes more complicated as data dimensionality increases due to unbounded data values and unbalanced classes.Bedi et al.[50] utilized numerous ML approaches to deal with the class imbalance issue.Thabtah et al.[51] also evaluated various approaches to the class imbalance problem.Most data samples are targeted by most of the algorithms while missing the minority data samples.As a result, minority samples appear irregularly but constantly.The main algorithms for solving the unbalanced data problem are data preprocessing and feature selection techniques, and every approach has both benefts and shortcomings.The ID dataset has a high-dimensional imbalance problem including missing features of interest, missing feature values, or the sole existence of cumulative data.The data appear to be noisy, containing errors and outliers, and unpredictable,comprising discrepancies in codes or names.We used over-sampling to resolve the problem of the imbalance; this involved enlarging the number of instances in the minority class by arbitrarily replicating them to increase the presence of the minority class in the sample.Although this procedure has some risk of overftting, no information was lost, and the over-sampling approach was found to outperform the under-sampling alternative.

With the accelerated growth of big data, DL approaches have fourished and have been widely utilized in numerous domains.In contrast to previous studies, we took a hybrid approach—the Anomaly-Misuse ID method—to two-stage classifcation to overwhelm the condition face by separate classifcation methods.We used Spark MLlib and the LSTMAE DL approach for ID,on the well-known real-time contemporary dataset ISCX-2012.

3 Proposed Approach

Fig.1 presents the anticipated ID framework.It comprises two learning stages.For this HIID,we planned to construct a two-stage ID system, in such a way that Spark MLlib as an anomaly in Stage-1 and LSTMAE as misuse in Stage-2.

These two stages of ID framework are effcient in terms of computational complexity while using full features datasets and offer a higher accuracy with a low probability of FAR.

Stage-1 Anomaly detection using Spark MLlib classifers.

Stage-2 Misuse detection using state-of-the-art deep learning approaches such as LSTMAE.

3.1 Overview of the Proposed IDS

The hybrid framework concentrates on resolving real-time ID problems, using enormous data analysis models (Apache Spark and Apache Hadoop) and AI (ML and DL).Controlling such an issue is a complex task due to space and time restrictions.Big data is enormous and consistently increasing in volume but requires prohibitive amounts of power, specialized resources, and a computational device that can effectively handle the data.The hybrid ID framework overcomes these problems by using the MLlib with LSTMAE.The main structure of the HIIDS existing here, forms the source of the experiment, to use Spark MLlib and deep learning.

Numerous ML techniques were used due to the huge volume of data.We selected the competent Spark to implement a logistic regression (LR) and extreme gradient boosting (XGB)classifers.Initially, preprocessing data was delivered through these machine learning classifers to produce regression models that present the opportunity of all data.Generally, this is the binary learning phase.In this hybrid ID approach, the NIDS using both anomaly and misuse techniques.The proposed hybrid ID architecture contains a data preprocessing module, a Spark MLlib classifcation component integrating the anomaly detection module (Stage-1) of the proposed hybrid IDS with misuse detection, and DL classifcation (Stage-2), followed by the alarm module.

Stage-1 utilized Spark MLlib to perceive anomalies that may be intrusions, and Stage-2 utilized the LSTMAE DL model, which further classifes attacks in the event they occur.The details of the proposed Spark MLlib and LSTMAE model are shown in Fig.1.

Figure 1:The micro overview of the proposed ID framework

The architecture of the hybrid IDS is as shown in Fig.1; initially network traffc was arranged and preprocessed.During preprocessing, all necessary conversions were made for both Stage-1 Spark MLlib and Stage-2 LSTMAE-based modules of HIIDS; both stages had their supported data formats.For our hybrid ID experiment, we used 1,512,000 network traffc packets attained from ISCX-2012 datasets to demonstrate the effectiveness of the proposed HIIDS.

3.2 Datasets

Choosing a suitable ID dataset plays a signifcant role in testing the ID system; therefore, the simulation of the proposed HIID approach was carefully deliberated.

3.2.1 Explanation of the ID Dataset

There are various standard datasets, and some of them comprise infexible, outdated, and irreproducible attacks.To overcome these shortcomings and create further up-to-date traffc patterns the ISCX-2012 data was created by the Canadian institution of cybersecurity [52].It contains various types of ID data to assess anomaly ID approaches.The ISCX-2012 data shows real network activities and includes numerous attack scenarios.Additionally, it is shared as a complete network capture with completely internal network traffc to assess payloads for network packet analysis.The ISCX 2012 ID data comprises both malicious and normal traffc actions for seven consecutive days.The data was created by profles containing abstract representations of network traces’ particular behaviors and activities.

Communication between the destination and source host over HTTP can be represented by the sending and receiving of packets, endpoint attributes, and other similar features.This illustration produces a unique profle.It produces realistic network traffc for POP3, SSH, FTP,HTTP, SMTP, and IMAP protocols.

The ISCX-2012 contains dual distinct profles to make network traffc activities and states.The multi-stage or abnormal states of abnormal attacks are identifed by anαprofle, while feature characterization and mathematical dissemination of the method are done with theβprofle.For example, theβprofle can comprise network traffc packet size distributions in the explicit patterns and time distribution request of the protocol, whereas theαprofle is created based on the sophisticated preceding attacks of a distinct day.Theαprofle consists of four kinds of attack scenarios.

(1) Internal infltration of the network traffc

A vulnerable application program, such as Adobe Acrobat Reader, generally takes the advantage of the internal infltration of a network.A backdoor can be performed on the victim’s machine after successful penetration and will execute several malicious attacks on the victim’s network.To detect these kinds of malicious threats, mostly applied Nmap and port scan.

(2) HTTP DOS attacks

The attacker causes a network resource to be unavailable for a particular time.This is typically done by overwhelming a network resource with superfuous requests to overwork the network and impede the fulfllment of some or all legitimate requests.To collect these kinds of DoS attacks,mostly utilized the Slow HTTP test, Hulk, Slow loris, and Goldeneye.

(3) DDOS using IRC botnet

These types of attacks generally occur when various networks food the bandwidth or various resources of a particular victim.Therefore, a DDoS attack is often the result of various infected networks (for example, a botnet), which food the target network by creating massive network traffc.These types of attacks have utilized LOIC for UDP, TCP, and HTTP.

(4) Brute force SSH

This is the most common type of attack that can be used not only to crack passwords but also to determine secret content and pages of several web applications.These types of attacks have been launched via FTP and SSH Patator tools.

The full ISCX-2012 dataset is shown in Tab.2.It can be observed that each attack state was realistic for a single day, while two daysn consisted of normal traffc.The variety of normal behavior and the complication of malicious attack states in the network have been previously described [52].

3.2.2 Data Preparation and Feature Engineering

The dump network traffc was initially preprocessed and prepared, as shown in Fig.1.The ISCX-2012 dataset was evaluated, and after preprocessing, it was composed of seven consecutive days with the systematic and practical circumstances refecting network attacks.The data were labeled for malicious and benign streams for a full of 68,792, and 2,381,532 records in the own class.The abnormal attacks were detected in the initial traffc data and were divided into two classes:benign/normal and abnormal/malicious.

Table 2:Daily traffc ISCX-IDS 2012 dataset summary

Furthermore, several multi-stage malicious intrusion scenarios were executed to generate various attack traces (e.g., HTTP, DoS, brute force SSH, infltration from the interior, DDoS via an IRC botnet).The detailed descriptions of training and testing data distributions are presented in Tabs.3 and 4.

Table 3:Testing and training data distribution of ISCX-2012

The core idea of this research was to evaluate the reliability of the hybrid system against anomalies and the unknown, via the misuse approach.Tab.4 presents the testing and training network traffc data for misuse attack detection using a state-of-the-art DL approach, such as an LSTMAE.

Table 4:ISCX-2012 data distribution for stage-2

3.3 Implementation Details

The dominance of the HIIDS is evaluated through experiments applying the ISCX-2012 ID datasets via normal and attack classifcations:false positive, false negative, true positive, attack detection precision, and error rate.To show the effcacy of our suggested ID system, we executed the frst stage in Scala by Spark MLlib for anomaly detection; the second stage was executed for misuse detection; the DL approach was executed in Java with Deeplearning4j.The simulation was done on a 64-bit cluster computer with 32 cores, 32 GB RAM, and Ubuntu version 14.04 OS.The software stack contained Java (JDK) 1.8, Spark v2.3.0, Deeplearning4j 1.0.0.alpha, and Scala 2.11.8.The deep learning was trained on an RTX 2080 Ti GPU with cuDNN, and CUDA facilitated the pipeline speed.

To measure the HIIDS performance, we frst split the dataset into train and test datasets.To form an effcient HIID framework, we utilized the training data and analyzed our hybrid approach with testing data.The block diagram of our anticipated HIID is presented in Fig.1.The ISCX-2012 with complete, original features are utilized to demonstrate the dominance of our proposed hybrid approach.The network traffc mixed with malicious and normal pass through spark MLlib Stage-1 which categorized data into malicious and normal classes.Stage-2 LSTMAE was modeled with malicious traffc; malicious traffc was further categorized into 4 analogous attacks.The hybrid approach overcomes the computational complexity while applying comprehensive features to the ISCX-2012 dataset with higher ID accuracy and low FAR.80% of the data with 10-fold cross-validation was utilized for training purposes, and the model was evaluated with a 20%held-out dataset.

3.3.1 Stage-1:The Anomaly-Based Detection Module

Apache Spark is a competent big data processing engine for detecting cybersecurity attacks.Spark MLlib is the most effcient big data analytics library currently available, executing over 55 ML algorithms [53,54].Spark MLlib is most suitable for ML tasks and is 10 times faster than Hadoop-based big data processing tools for iterative tasks.MLlib of spark evolution was initiated in 2012 as a portion of an ML-based project, and in 2013 it became an effective open-source library for ML tasks.Spark MLlib contains several ML algorithms for instance classifcation, clustering algorithms, and regression and dimensionality reductions that are crucial to the development of classic ML real-time applications; its mechanisms have been established by several scholars to progress high dimensional data analytics worldwide.

MLlib-based anomaly attack detection at Stage-1 was frst modeled based on an established training set, which contains both normal and malicious traffc.The test data that contain unknown, regular, and malicious traffc are used to validate the anomaly module of IDS.The attack observed on original traffc data were divided into two classes:Abnormal (malicious) and normal.Abnormal network traffc behavior is known as anomaly traffc.Detection of this kind of abnormal network traffc was passed through the Stage-2 that LSTMAE, where the misuse attack detection technique did further attack detection and classifcation.

3.3.2 Stage-2:A Misuse Detection Module

LSTMAE was used in this stage to defne the misuse of network traffc and goals of further classifying the anomalous traffc according to specifc policies.An overview of the misuse detection module using an LSTMAE is given in Fig.2.LSTM is an upgraded version of the RNN,which was introduced in [55,56] to effciently address vanishing and exploding gradient issues.All hidden layers of RNN are substituted with memory blocks that comprise a memory cell intended to reserve information, with three important gates that play dynamic roles in LSTM(Input, Output, forget gate) [57].The most powerful feature of LSTM lies in its capability to capture long dependencies and learn competently from variable amount sequences.

Figure 2:The micro-overview of LSTMAE

Research has shown that LSTM demonstrates high confdence and effectiveness for resolving issues of video classifcation [58], sentiment analysis [59], emotion recognition [60], and abnormal activities [61].

In this module, LSTMAE is used as a misused attack detection technique.LSTMAE misuse attack detection techniques aim to further categorize the abnormal data from Stage-1 among equivalent classifcation:DOS, Scan, HTTP, and R2L.While misuse ID uses the LSTMAE, the technique was initially trained in the abnormal traffc to create a model that provides the baseline profle for abnormal traffc.A test set is an input to the training model that tests whether the training model performance is malicious (abnormal) or normal.An alarm goes off when a match is found.More internal information can be effectively obtained with LSTMAE, compared with other hand-crafted techniques.

4 Experimental Evaluations

A detailed description of the experimental results will be discussed in this section.Since the dominance of the proposed HIID is sensibly analyzed, this can only be realized throughout experiments applying the ISCX 2012 ID datasets via normal and attack classifcation, false positive,false negative, true positive, attack detection accuracy, and error rate.

4.1 Performance Metrics

The elements of the confusion matrix that assist in representing the expected and predicted classifcation are given in Tab.5.The outcome of classifying is predicated among twoclass issues such that correctly and incorrectly.Four essential states must be computed in the confusion matrix.

• True Positive (TP).It presents that model is accurate as normal and predicts positive and it is represented by x.

• False-negative (FN).It represented the wrong prediction and denoted by y.It identifes instances that are malicious in certainty, as normal and the model inaccurately predicts negative.

• False-positive (FP).It presents a model that mistakenly predicts positive and, the number of detected attacks is normal.It is represented by z.

• True negative (TN).It is represented by t and specifes instances that are correctly observed as an attack predicts negative.

Table 5:Confusion matrix for proposed IDS

From the above-mentioned conditions of the confusion matrix, we can compute the performance of the system as follows.The two most essential and general parameters for the evolution of the ID system are TPR or DR and FAR.The percentage of intrusion instances recognized by the ID model is known as DR, while the amount of misclassifed normal instances is known as FAR.

We claim that the HIIDS is superior to conventional IDS, as it increases DR and decreases FAR.

4.2 Evaluation of the Hybrid IDS

Tab.6 presents the overall performance of several classifers.The results of the random search are described in this section.As presented in the table, the classical LR model gave an F1-score accuracy of ~83%, whereas tree-based ML classifers managed to considerably increase the accuracy to 88%.

However, the most signifcant improvement that we observed was with state-of-the-art DL approaches such as LSTMAE, which correctly identifed misuse for up to 97.0% of cases.This improvement was due to the temporal feature’s extraction with LSTM and the extraction of more important internal information by the AE.

Table 6:Classifer performance at several stages

4.3 Overall Analysis

Tab.7 summarizes the results of the current approach for the ISCX-2012 data.These datasets were produced later than the KDD and DARPA data, so only a few corresponding tentative results exist.Therefore, using the existing simulation results, the best outcomes for each stage are defned by FAR and accuracy.It is evident that the proposed ID system performs well,both in terms of accuracy and FAR related to state-of-the-art techniques.This is owing to the Spark MLlib and LSTMAE approach.It is essential to observe that the comparisons are for just reference, as several researchers have utilized diverse volumes of data distributions, sampling techniques, and preprocessing methods.Therefore, a simple evaluation for metrics, such as testing and training time, is generally not suitable.Although the proposed ID system attained enhanced performance for the considered evaluation metrics, it cannot be fascinated that the proposed approach fully outclassed other methods.It is possible to attain an extraordinary level of network security with the HIID approach, which is vigorous, fast, simple, and highly applicable to real-time scenarios.

Table 7:Comparison of existing approaches to ISCX-2012 data

5 Conclusion and Future Work

In this article, the HIIDS was developed using the Spark MLlib and LSTMAE deep learning approach, which is an effcient cybersecurity method.We trained the HIIDS using an ISCX-2012 dataset.We implemented the HIIDS using several robust classifcation algorithms, such as LR and XGB, for anomaly detection at Stage1 and the LSTMAE deep learning technique for misuse detection at Stage 2.The proposed HIIDS, based on DL classifcation, combines the benefts of both Signature-based (SB) and Anomaly-based (AB) approaches, reducing computational complexity and increasing ID accuracy and DR.

Both conventional ML and LSTMAE deep learning models were evaluated using well-known classifcation metrics, such as F1 score, Precision, Recall, DR, and accuracy of classifcation.

We believe that our approach can be expanded to other domains in the future; misuses and anomalies can be recognized in several real-time image data, emphasis on exploring deep learning as a features extraction mechanism to learn knowledgeable data illustrations in case of other anomaly detection issues in modern real-time datasets.

Funding Statement:This research was supported by the MSIT (Ministry of Science, ICT), Korea,under the ITRC (Information Technology Research Center) support program (IITP-2020-2016-0-00465) supervised by the IITP (Institute for Information & Communications Technology Planning& Evaluation).

Conficts of Interest:The authors declare that they have no conficts of interest to report regarding the present study.