An Automated Penetration Semantic Knowledge Mining Algorithm Based on Bayesian Inference

2021-12-16 06:39YichaoZangTairanHuTianyangZhouandWanjiangDeng
Computers Materials&Continua 2021年3期

Yichao Zang, Tairan Hu,Tianyang Zhou and Wanjiang Deng

1State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, 450000,China

2National Engineering Technology Research Center of the Digital Switching System, Zhengzhou,450000, China

3NUS Business School,National University of Singapore, Singapore, 119077,Singapore

Abstract:Mining penetration testing semantic knowledge hidden in vast amounts of raw penetration testing data is of vital importance for automated penetration testing.Associative rule mining, a data mining technique, has been studied and explored for a long time.However, few studies have focused on knowledge discovery in the penetration testing area.The experimental result reveals that the long-tail distribution of penetration testing data nullifies the effectiveness of associative rule mining algorithms that are based on frequent pattern.To address this problem, a Bayesian inference based penetration semantic knowledge mining algorithm is proposed.First, a directed bipartite graph model, a kind of Bayesian network, is constructed to formalize penetration testing data.Then, we adopt the maximum likelihood estimate method to optimize the model parameters and decompose a large Bayesian network into smaller networks based on conditional independence of variables for improved solution efficiency.Finally, irrelevant variable elimination is adopted to extract penetration semantic knowledge from the conditional probability distribution of the model.The experimental results show that the proposed method can discover penetration semantic knowledge from raw penetration testing data effectively and efficiently.

Keywords: Penetration semantic knowledge; automated penetration testing;Bayesian inference;cyber security

1 Introduction

With the increasing size of computer networks and complexity of information systems, security problems faced by corporations have become much more prominent than ever [1].Penetration testing, as a common security check approach, is widely used in discovering attack paths existing in computer networks to improve the security level of systems.Penetration testing aims to obtain control over specific computers in a target network.Usually, it starts with a controlled computer, based on which hackers try to gather computer and network information through scanners and vulnerability exploitation.After gathering information, hackers choose the best vulnerability exploitation program to penetrate the target host.Even though penetration testing could discover attack paths hidden in network, the quality of penetration testing depends heavily on the security expert’s experience,and it has become a popular topic to automate penetration testing in both academic and engineering areas[2].Mining semantic knowledge hidden in raw penetration testing data is an essential part of the solution for automated penetration testing.Penetration semantic knowledge is a kind of mapping relation {software: version→vulnerability} which means that the specific version of the software may cause the vulnerability, based on which we could choose the corresponding vulnerability exploitation program to control the host when faced with the specific version of the software.Existing studies have extracted penetration semantic knowledge by transforming specific vulnerability database such as metasploit and nessus.There are two disadvantages to this kind of approach: one is that the semantic knowledge extracted from a specific vulnerability database does not match the information gathered through scanners, for example, the operating system information gathered by nmap scanner is windows_7_sp1 for vulnerability numbered cve-2019-0708(common vulnerability and exposure, cve), but the operating system description of vulnerability numbered cve-2019-0708 in metasploit is windows_7 sp1.Even though they look similar(windows_7_sp1 and windows_7 sp1), they do not match each other, missing the knowledge rule{windows_7 sp1→cve-2019-0708} during penetration testing.The other is that the penetration semantic knowledge becomes tedious without considering information gathered through multiple scanners.During penetration testing, multiple tools are used to gather information to form intact knowledge, meaning that the penetration semantic knowledge is composed of multiple sources, and transforming a specific vulnerability database can not meet the demand.Tab.1 shows an example of host configuration and vulnerability information gathered by nmap and shodan scanners [3].It indicates that the software Apache:2.4, OpenSSH:7.4 and CCMPlayer:1.5 are installed on the host, the operating system is either windows xp sp3 or windows 7, and there is also a vulnerability numbered cve-2011-5170 on the host.The penetration knowledge hidden in the raw data is {CCMPlayer:1.5, windows xp sp3→cve-2011-5170}which means that when faced with the software CCMPlayer:1.5 and the operating system windows xp sp3, the vulnerability exploitation program corresponding to cve-2011-5170 could be used to control the target host.In summary, the aim of penetration semantic knowledge mining is to discover all of these reliable mapping relations existing in raw penetration testing data gathered by multiple scanners.

Table 1: Example of scanned configuration and vulnerability information for a single host through nmap and shodan scanners

Considering the problems with current penetration semantic knowledge mining methods, this paper applies Bayesian inference techniques to mine semantic knowledge from raw penetration testing data.The remainder of the paper is organized as follows.Section 2 presents related works involving associative rule mining algorithms that could be used to mine penetration semantic knowledge.Section 3 introduces some definitions, and the problem statement is given.Section 4 presents the proposed Bayesian inference based penetration semantic knowledge mining algorithm.In Section 5, the details of penetration testing data are analysed, and the performance of associative rule mining algorithms is compared to the performance of the proposed method on the same data.Section 6 summarizes the study and notes some future research objectives.

2 Related Works

Associative rule mining [4], which was first proposed by Argawal et al., aims to discover itemsets that appeared frequently, and this approach is similar to penetration semantic knowledge mining.The Apriori algorithm [5] was the first algorithm used to mine associative rules by generating candidate frequent itemsets.However, this method has high computational complexity because of the large number of database scan operations required.Han et al proposed the FP-Growth [6] algorithm to address this problem, in FP-Growth,an FP-tree structure is created to avoid database scan operations.However, the FP-Growth algorithm is limited to small-scale databases because of the extremely high memory consumption.Qin et al.[7] proposed compact FP-tree to mine associative rules to avoid conditional FP-tree generation to decrease memory consumption.Zaki proposed the Eclat [8] algorithm, which can mine associative rules by applying set union operations based on a vertical database structure.Pei proposed another memory-based associate rule mining algorithm, H-mine [9], based on an H-struct structure.H-mine achieves frequent pattern mining by partitioning data so that it can mine scalable databases.Deng proposed a data structure called N-list for a cache database, and the PrePost [10] algorithm was proposed based on N-list.Deng also adopted childrenparent equivalence to prune search branches in PrePost, which effectively reduced the number of candidate frequent itemsets [11].Because the N-list structure requires high memory consumption for both pre-traversal and post-traversal information, Deng proposed the NodeSets structure [12] to store pre-traversal information to mine associative rules; this approach reduced memory consumption by 50%.Aryabarzan et al.proposed another prefix tree structure, NegNodeSet [13], and adopted bitmap technology to mine associative rules.In addition, some researchers adopted high utility itemset mining techniques to achieve associative rule mining.The two-phase algorithm [14] is a famous and classical candidate based high utility itemset mining algorithm that is composed of two phase: the first phase prunes the search space and generates candidates by the proposed transaction weight downward closure property, and the second phase scans the database to filter high utility itemset from high transaction weight utility itemsets identified in phase I.Vincent et al.proposed an algorithm named UP-Growth[15]to mine high utility itemset,this algorithm can construct a utility pattern tree from databases based on DGU(discarding global unpromising items), DGN(discarding global node utility), DLU(discarding local unpromising items) and DLN(discarding local node utility) strategies to prune the search spaces.Even though many tricks have been proposed to prune search spaces, there are still many candidate itemsets waiting to be tested in phase II, which will consume a large amount of memory and time.To overcome these problems, Liu et al.proposed an algorithm called HUI-Miner [16] to mine high utility itemsets without generating candidates.HUI-Miner uses a novel structure called utility-list to store both utility information and heuristic information of the itemset for pruning the search space.Based on the utilitylist, the high utility itemsets can be mined by joining utility lists instead of by scanning the database, which decreases the mining time.Further, Krishnamoorthy et al.[17] proposed an algorithm called HUP-Miner that employs two novel pruning strategies to prune search spaces.Peng et al.utilized IHUP tree structure [18] to guide the itemset expansion process to avoid considering itemsets that are nonexistent in the database.Although associative rule mining algorithms can be used to mine penetration semantic knowledge from penetration testing data,the effect is not ideal.Notably,some host configuration and vulnerability information does not meet the requirements for frequent patterns.

3 Definitions and Problem Statements

3.1 Definitions

Definition 1.Given itemset I ={i1,i2,···,in} and database DB={T1,T2,···,Tm} where Ti,1 ≤i ≤m is a transaction that satisfies Ti⊆I, for any itemset A ⊆I, A is contained by transaction Tiwhen A ⊆Tiholds.The support of A is the number of transactions containing A, denoted as sup(A).A is called a frequent itemset if sup(A)≥ξ for any specified minimal support ξ.

Definition 2.A Bayesian network is a kind of probabilistic graphical model, denoted as G={V,E},where V is a set of random variables and E is a set of dependence relationships among variables, denoted as p(v|π(v)); π(v) is set of parent nodes of v.A Bayesian network is used to represent cause-effect relationships, denoted as <π(v), v >∈E; the specific value for π(v) is the cause, and v is the effect.Thus, <π(v)=a,v=b >has a strong causal effect if p(<π(v)=a,v=b >)≥α holds.

Definition 3.d-separation can be defined as follows:Given a Bayesian network G={V,E}and a set X,Y,Z ⊆V, where each pair is disjoint, for each path between X and Y, Z d-separates X and Y under the following three conditions: (1) There is one node a that belongs to Z in the path, and the two neighbours b and c, which are in sets X and Y, respectively, follow the same edge direction <b,a ><a,c >or <a,b ><c,a >.(2) There is one node a that belongs to Z in the path, and the two neighbours b and c are in sets X and Y, respectively, with the inverse edge direction <a,b ><a,c >.(3) There is one node a in the path, the node and its descendants do not belong to Z, the two neighbors b and c are in sets X and Y,respectively,and the edge direction satisfies <b,a ><c,a >.

Definition 4.A directed bipartite graph G={V,E} is a kind of special Bayesian network for which V can be divided into two disjoint sets A and B.For each edge <vm,vn>∈E in the graph, vmbelongs to A,and vnbelongs to B.

3.2 Problem Statement

Penetration semantic knowledge mining aims to discover associative rules from vast amounts of penetration testing data, including rules for open ports, services, operating systems, etc.There are two problems with penetration semantic knowledge mining: One is that there is much more irrelevant penetration semantic knowledge than critical knowledge data, and the other is that there are many invalid rules that can not be used to improve the efficiency of automated penetration testing.Therefore,penetration semantic knowledge mining aims to discover rare penetration knowledge while avoiding redundant and low-value penetration testing data.

4 Penetration Semantic Knowledge Mining

4.1 Model

Considering the above problem,we adopted a Bayesian network to formalize penetration testing data,and the maximum likelihood estimation method was adopted to retrieve parameters.Finally, we adopted a variable elimination method to retrieve the probability of each mined rule and discover all penetration testing knowledge of interest.The steps are described as follows.

Step 1: Given host configuration information A=Aiand vulnerability information B=Biwhere i=1,···,k is the host identifier, the Bayesian network is composed of union set A ∪B, where A is set of conditional nodes with host configuration information for the operating system, application, etc.,and B is a set of deductive nodes of host vulnerabilities.The Bayesian network is constructed by adding all directed edges <vm,vn>, where vm∈Aiand vn∈Bi.Because there are only edges from the conditional node to the deductive node, the constructed Bayesian network is actually a directed bipartite graph, as shown in left panel of Fig.1.

Figure 1: Bayesian inference based penetration knowledge discovery.The left panel shows a Bayesian network model based on original penetration data and the right panel shows penetration knowledge inferred based on a Bayesian network model constructed with the method below

Step 2: The parameters of the directed bipartite graph can be described with a conditional probability distribution p(v|π(v)), where π(v), the set of conditional nodes, is the parent set of deductive node v.If π(v) is empty, then the conditional probability distribution reduces to the initial probability distribution.Finally, the parameters of the directed bipartite graph can be described as set θ, where each element in θ is represented as

which is the probability of vulnerability b conditioned on a combination of host configurations a.Concretely,this value is the probability of vulnerability cve-2017-0472 occurring when windows 10 and the IE 10 browser are installed on a host.Given the constructed Bayesian network, the parameter θabis expressed as follows by optimizing the maximum likelihood function.

where p(Dl|θ)is the probability of record Dlbased on θ.

Step 3: Given the maximum likelihood function L(θ|D), we can optimize the objective function to retrieve parameter θab, based on which we can obtain the probability for each expression<π′(v)=a,v=b > by simplifying irrelevant variables, where π′(v)⊆π(v).If the probability of<π′(v)=a,v=b >is larger than the user-specified threshold α, the penetration testing rule of interest is established; otherwise, it is not.Specifically, as shown in right panel of Fig.1,the host is influenced by vulnerability b with high probability when the combined host configuration is a.The algorithm for constructing the directed bipartite graph is shown in Alg.1.

Algorithm 1.Directed bipartite graph construction algorithm

4.2 Penetration Semantic Knowledge Mining Algorithm

In this section, we introduce the use of penetration testing data to calculate the parameters of the Bayesian network and improve the parameter solution efficiency through d-separation.However, we first introduce the following theorem.

Theorem 1: Given a directed bipartite graph G(V, E), where V =A ∪B, A is set of conditional nodes,and B is a set of deductive nodes.P(a,b|M)=P(a|M)P(b|M) holds for every two nodes a,b ∈B and set of conditional nodes M ⊆A.

Proof.First,we prove that P(a,b|m)=P(a|m)P(b|m)holds for each conditional node m ∈M.Because the Bayesian network G is a directed bipartite graph, there are three kinds of relations between m and a, b.(1)Both<m,a >and <m,b >belong to E.(2)One and only one of <m,a >or <m,b >belongs to E.(3)Neither <m,a > or <m,b > belongs to E.For (1), m is the branch node of a and b, and P(a,b|m)=P(a|m)P(b|m) holds.For (2), assuming that <m,a >∈E, there is no connection between b and a, m; therefore, b is independent from a and m, so P(a,b|m)=P(a|m)P(b)=P(a|m)P(b|m) holds.For (3), there are no connections among a, b and m, so these nodes are independent from each other, and we can conclude that P(a,b|m)=P(a,b)=P(a)P(b)=P(a|m)P(b|m).For an arbitrary node m,P(a,b|M)=P(a|M)P(b|M) holds.

Theorem 1 indicates that the parameter solution process for a large Bayesian network can be divided into many smaller processes to improve efficiency.By assuming that the Bayesian network constructed from penetration testing data D={Dl|1 ≤l ≤n} is denoted as G (V, E), the maximum likelihood function L(θ|D)can be expressed as follows:

where ϕ(i,j,k) is the number of records that satisfy π(vi)=j and vi=k.We can optimize the objective function and obtain the parameter as follows:

representing the ratio of records for which the value of viis k,the value of the corresponding parent π(vi)is j,and all possible values of viare considered.Assuming that the obtained parameter set is denoted as θ={θ1::,θ2::,···θn::} where θi::=p(vi|π(vi)) is the conditional probability distribution of node vi, we need to implement a variable elimination process [19] to retrieve key factors.Assuming that the sequence of variable elimination is denoted as ρ,ρ ⊆π(vi)/π′(vi), for each variable z ∈ρ, the probability distribution of node vican be represented as θi::=p(vi|π(vi))={p1,p2,···,pm} according to the definition of θi::, where m is the number of free variables.Let g =pibe the function of variable z in the Bayesian network, d be the number of free variables containing z in the probability distribution set of vi,and h=be the probability distribution function after eliminating variable z and adding it back to θi::.Then, ρ is iterated until it is an empty set to obtain all possible penetration semantic rules with probabilities larger than a user-specified support value α.The final penetration semantic knowledge mining algorithm is shown as follows.

Algorithm 2.Penetration semantic knowledge mining algorithm

5 Experiments

In this section,we compare our proposed algorithm with other associative rule mining algorithms based on four real penetration testing datasets.

5.1 Metric

The experimental metric adopted is the ROC curve, which is used to evaluate the performance of algorithms.This curve is composed of two parts, the true positive ratio (TPR) and false positive ratio(FPR), and corresponding formulas are shown as follows:

where P is the set of all true knowledge rules,N is the set of all false knowledge rules and D is the set of all mined knowledge rules.

5.2 Dataset

The experimental datasets encompass four services,namely,Apache,IIS,MySQL and nginx,containing host configuration information and vulnerability information.The process used to collect this information is described as follows.First, we collect host IPs containing these services through a zoomeye scanner [20].Then, we collect host configuration information for each IP with an nmap scanner.Last, we use a shodan scanner to collect vulnerability information for each IP and merge this information with the host information collected above to form an intact penetration testing record for each IP.To describe the distribution characteristics, we plot the information in Fig.2, from which we can see that the plot of the number of vulnerabilities has a long-tail.There are many vulnerabilities that have a few records.The number of records for most vulnerabilities is limited within 10, and the records for the vulnerability type account for approximately 17.5% of those of all vulnerabilities.The comparative algorithms are implemented based on SPMF [21] and include the Apriori, FP-Growth, LCMFreq and PrePost +algorithms.The support of the algorithm ranges from 0.2 to 0.8 and the host configuration is a Core-i7-8750H12 with 64GB of memory.

Figure 2: Quantity distribution of host vulnerability regarding to record number

5.3 Results and Analysis

This section compares the proposed algorithm with traditional associative rules mining algorithms based on accuracy and efficiency.The accuracy of algorithms versus the support based on the Apache dataset is shown in Tab.2, from which we can see that the accuracy of most associative rule mining algorithms is low, at less than 2%.Notably, the data have a long-tailed distribution, and most of the cases have low support, causing the associative rule mining algorithms to fail.However, the proposed algorithm displays better performance than the other algorithms, with the highest accuracy reaching 98.22%.The accuracy decreases with increasing support because the higher the support level is, the fewer the number of knowledge rules provided for the support.

Table 2: Comparison of the accuracy ratio for penetration testing knowledge discovery based on the Apache dataset with different support levels

As accuracy cannot reflect all aspects of performance,we adopt receiver operating characteristic curves(ROC)analysis to describe the performances of the algorithms,as shown in Fig.3.The performance of the algorithms is assessed based on four experimental datasets,where(a),(c),(e)and(g)display the performance for support values ranging from 0.2 to 0.8 and(b),(d),(f)and(h)are locally enlarged regions of(a),(c),(e)and(g)respectively.Because of the memory corruption problem,there is no line representing FP-Growth on IIS and nginx datasets.Furthermore,from(a),(c),(e)and(g),we can see that the ROC curve is sensitive to the support parameter.Although the ROC curve is a similar curve to the standard line,this similarity does not mean that the proposed algorithm is a random algorithm because mining penetration semantic knowledge rules from large quantities of penetration testing data is not a simple classification task.Moreover, the ROC curves of the comparative associative rule mining algorithms show a “cluster” phenomenon,and the accuracy is low regardless of the value of the support parameter.This phenomenon illustrates that associative rule mining algorithms are not suitable for penetration semantic knowledge mining problems,and the performance of these methods is far less than the performance of the proposed method.Additionally, the locally enlarged plots show that the Apriori algorithm performs better than other associative rule mining algorithms, but the corresponding accuracy is still less than 2%, which is far from meeting the requirements for penetration semantic knowledge mining.To obtain the best support for the proposed algorithm, we adopt the Youden index [22] to evaluate the performance of the proposed algorithm based on two different support levels.The Youden index formula is shown as follows:

Figure 3: Receiver operating characteristic curves and locally enlarged receiver operating characteristic curves of different algorithms for each dataset.(a) Apache.(b) Apache detail.(c) IIS.(d) IIS detail.(e)MySQL.(f) MySQL detail.(g)Nginx.(h)Nginx detail

The performance of the Youden index based on the Apache dataset is shown in Tab.3,from which we can see that the critical value is 0.0559, meaning that the best support parameter is 50%.When the best support parameter is used,the TPR improves,and the FPR decreases.

Furthermore,we compared the memory and CPU consumption of algorithms on the Apache dataset.The performance is shown in Fig.4, where each bottom point is used to differentiate support.Fig.4 shows that associative rule mining algorithms have similar memory consumption, and the PrePost + algorithm consumes the most memory (approximately 1080 MB).The proposed algorithm has a similar memory consumption of approximately 1250 MB.Regarding CPU consumption, the proposed algorithm consumes less memory than the compared associative rule mining algorithms, with a reduction of approximately 1.8%.What causes this difference is that the proposed algorithm transforms the parameter solution problem into a statistical problem so that the computational complexity is considerably reduced, thus decreasing CPU consumption.

Table 3: Comparison of Youden index values for the proposed algorithm based on the Apache dataset

Figure 4: Comparison of memory and CPU consumption for each data mining algorithm based on the Apache dataset.(a) Apriori.(b)FP-Growth.(c) LCMFreq.(d) PrePost+.(e) Bayesian inference

6 Conclusion

Considering the problem of mining penetration semantic knowledge to automate penetration testing,this paper proposed a Bayesian inference based penetration semantic knowledge mining algorithm.First, a Bayesian network model is constructed according to penetration testing data, and the parameter solution process divides the whole network into smaller subnetworks according to independence analysis, after which the variable elimination method is adopted to retrieve penetration semantic knowledge.The experimental results show that the proposed method is superior to other associative rule mining algorithms.Moreover, there are still some tasks that could contribute to this work such as adopting domain knowledge in penetration semantic knowledge mining to improve the accuracy and decreasing computational complexity through parallel techniques.

Funding Statement:This research was supported by the National Natural Science Foundation of China No.61502528.

Conflicts of Interest:We declare that there are no conflicts of interest to report regarding the present study.