A Fast Algorithm for Mining Top-Rank-k Erasable Closed Patterns

2022-08-24 07:01HamNguyenandTuongLe
Computers Materials&Continua 2022年8期

Ham Nguyen and Tuong Le

1Faculty of Information Technology,HUTECH University,Ho Chi Minh City,Vietnam

2Informetrics Research Group,Ton Duc Thang University,Ho Chi Minh City,Vietnam

3Faculty of Information Technology,Ton Duc Thang University,Ho Chi Minh City,Vietnam

Abstract: The task of mining erasable patterns(EPs)is a data mining problem that can help factory managers come up with the best product plans for the future.This problem has been studied by many scientists in recent times,and many approaches for mining EPs have been proposed.Erasable closed patterns (ECPs) are an abbreviated representation of EPs and can be considered condensed representations of EPs without information loss.Current methods of mining ECPs identify huge numbers of such patterns,whereas intelligent systems only need a small number.A ranking process therefore needs to be applied prior to use,which causes a reduction in efficiency.To overcome this limitation,this study presents a robust method for mining toprank-k ECPs in which the mining and ranking phases are combined into a single step.First,we propose a virtual-threshold-based pruning strategy to improve the mining speed.Based on this strategy and dPidset structure,we then develop a fast algorithm for mining top-rank-k ECPs,which we call TRK-ECP.Finally,we carry out experiments to compare the runtime of our TRK-ECP algorithm with two algorithms modified from dVM and TEPUS(Top-rank-k Erasable Pattern mining Using the Subsume concept),which are state-of-the-art algorithms for mining top-rank-k EPs.The results for the running time confirm that TRK-ECP outperforms the other experimental approaches in terms of mining the top-rank-k ECPs.

Keywords: Pattern mining;erasable closed pattern mining;top-rank-k pattern mining

1 Introduction

Frequent pattern mining is one of the most popular topics in data mining and involves extracting frequent itemsets from a database.By identifying frequent patterns,we can observe that some items are strongly correlated together and can easily recognize similar characteristics and associations among them.This topic has attracted a lot of research attention,and many methods have been proposed,such as dEclat [1],FP-Growth * [2],DBV-FI (Dynamic Bit Vector for mining Frequent Itemsets)[3],and NSFI (N-list and Subsume-based algorithm for mining Frequent Itemsets) [4].Frequent itemsets have also been applied to solvethe problem of multi-attribute users under conditions of local differential privacy[5].There are also numerous variations of pattern mining,including frequent closed itemset mining[6],maximal frequent patterns[7],frequent weighted itemset mining[8],utility patterns[9,10],colossal patterns[11],erasable itemset mining[12],and so on.These variations have different meanings and are used in intelligent systems for specific situations.In addition,too many patterns are often generated when traditional approaches are used in intelligent systems,making it time-and resource-consuming to rank the patterns and find the most promising.Several top-kand top-rank-kapproaches have been developed [13-18]for different types of patterns and rules,such as frequent patterns,frequent weighted patterns,closed sequential patterns,and association rules,in order to combine the mining and ranking processes into a single algorithm.This can help intelligent systems to work better.

Erasable pattern mining,developed by Deng et al.[12]in 2009,aims to help manufacturing managers to come up with the best production plans for the coming years.This problem often arises in the product planning process in a factory.In this scenario,a factory produces a wide range of products,each of which is made from certain components(items)and earns a particular amount of money for the manufacturer as profit.The current production plan requires the manufacturer to spend large amounts of money to purchase and store these items.In unexpected situations,such as a financial crisis or the COVID 19 pandemic,the manufacturer cannot afford to purchase all the necessary items as usual;managers therefore need to reconsider their production plans to ensure the stability of the manufacturing process.In the case described above,the problem involves finding itemsets that can be eliminated but do not greatly affect the factory’s profit.In other words,we wish to find the sets of itemsets which can best be eliminated(erased)(called EPs)so that managers can then utilize this knowledge to create a new production plan that minimizes the profit reduction.Numerous algorithms have been proposed to solve the EP mining problem,such as META [12],MEI (mining erasable itemsets)[19],EIFDD(erasable itemsets for very dense datasets)[20],pMEI(parallel mining erasable itemsets)[21],and BREM(bitmap-representation erasable mining)[22].Several variations have also been developed,such as mining EPs with constraints[23],mining erasable closed patterns[24],mining maximal EPs [25],mining top-rank-kEPs [26,27]mining erasable patterns in incremental database[28,29],and mining EPs in data streams[30-33].

When applied to the problem of mining erasable closed patterns(ECPs),the traditional mining approaches proposed in [24]give very large numbers of patterns,and this is the reason for the low efficiency of intelligent systems.These systems mine ECPs by applying traditional algorithms(in the mining phase)and then rank the results to select the top patterns based on their gains(in the ranking phase).This two-phase process is time-and resource-consuming,and such systems may even fail to run due to memory and storage space limitations.Hence,in this paper,we address the problem of mining top-rank-kECPs in order to combine the mining and ranking phases.The key contributions of this study are as follows:(i)we develop a virtual-threshold-based pruning strategy for the mining of toprank-kECPs,in which the virtual threshold is set to the highest threshold of the results in the mining process,thus helping to improve the mining time by avoiding the creation of unsatisfactory candidates;(ii)we present the TRK-ECP algorithm,which uses a virtual-threshold-based pruning strategy for the mining of top-rank-kECPs;(iii)we conduct experiments to demonstrate the effectiveness of the TRKECP algorithm in terms of the mining time,the results of which indicate that TRK-ECP outperforms two alternative modified algorithms(namely dVM and TEPUS)in terms of the mining time for toprank-kECPs.

The rest of this study is structured as follows.An overview of the basic principles of EPs and ECPs is given in Section 2.A definition of the mining of top-rank-k ECPs and a fast-mining algorithm are introduced in Section 3.Our experimental results are presented in Section 4,and it is shown that these confirm the effectiveness of the TRK-ECP algorithm.The conclusions of the study and suggestions for future work are given in Section 5.

2 Basic Principles

This section introduces the basic concepts needed to understand the topic of this study.Section 2.1 presents the definition of EPs and gives examples based on a toy product dataset.Section 2.2 discusses the concept of an ECP and presents some examples.

2.1 Erasable Patterns

Deng et al.[12]introduced an interesting problem called erasable pattern mining,as briefly summarized in this section.We consider a product dataset for a factory,denoted asDB.This dataset containsnproducts,asP={P1,P2,...,Pn}.Each product in this dataset is created from a set of components.For convenience in terms of our study of pattern mining,the set of all components can be considered as the set of all items,denoted byI={i1,i2,...,im}.Each product in the product dataset is represented in the form〈Items,Val〉,in whichItemsare the components required to manufacture this product,andValis the profit that the factory gains by selling this product.A toy product dataset is presented in Tab.1 and is used as an example throughout this study.

Table 1:Toy product dataset

Definition 1.For a product dataset denoted byDBand a threshold ξ,a patternXis an erasable pattern if and only if:

In Eqs.(1)-(3),g(X) is the gain due toX,andTis the total profit of the whole datasetDB.According to Eq.(1),the problem of mining EPs involves finding all patternsXwith gainsg(X)that are less thanT×ξ.

Example 1.For the product dataset which shown in Tab.1,we calculate the total profit asT=We can calculate the gain from itemA5as follows:g(A5)==P2.Val+P3.Val+P4.Val+P5.Val+P6.Val=200+150+50+100+200=$700.For a threshold ξ=50%,A5is an EP,asg(A5)=700<T×ξ=$975.

2.2 Erasable Closed Patterns

In 2017,Vo et al.[24]presented the definition of an ECP,which can be summarized as follows.An EP is called an ECP if and only if there are no supersets that have the same gain.For instance,consider the dataset in Tab.1 with ξ=40%.In this example,A5andA5A3are two EPs,asg(A5)=g(A5A3)=700<T×ξ=$780.In terms of the problem of ECP mining,A5is not an ECP,because we have a superset(A5A3)with the same gain asA5,i.e.,g(A5)=g(A5A3)=$700.Hence,A5A3is an ECP andA5is not.Tab.2 provides all ECPs for our dataset with ξ=40%.

Table 2:ECPs for the example dataset with ξ=40%

2.3 dPidset Structure

In 2014,Le et al.[19]developed the dPidset structure,which is typically used to mine EPs.In this study,we also use this structure to mine top-rank-kECPs.We can summarize the dPidset structure as follows.Firstly,the pidsetp(X)of patternXis determined as:

whereAis an item in patternX,andp(A)is the set of products that containA.

Example 2.For the dataset in Tab.1,the pidsets ofA5,A3,andA6arep(A5)={2,3,4,5,6},p(A3)={3,5},andp(A6)={4,6,8},respectively.We havep(A5A3)=p(A5)∪p(A3)={2,3,4,5,6}.We also havep(A5A6)=p(A5)∪p(A6)={2,3,4,5,6,8}.

Definition 2.LetXAandXBbe two patterns.The dPidset ofXAB,denoted asdP(XAB),is computed by the following equation:

Example 3.In Example 2,the pidsets ofA5,A3,andA6arep(A5)={2,3,4,5,6},p(A3)={3,5},andp(A6)={4,6,8}.Based on Definition 2,we know that the dPidset ofA5A3isdP(A5A3)=p(A3)/p(A5)=Ø.Similarly,the dPidsets ofA5A6andA5A3A6aredP(A5A6)=p(A6)/p(A5)={8},anddP(A5A3A6)=p(A5A6)/p(A5A3)={8},respectively.

Theorem 1[19].LetXAandXBbe two patterns with dPidsetsdP(XA)anddP(XB),respectively.The dPidset ofXABis determined using the following equation:

Example 4.In Example 3,the dPidsets ofA5A3andA5A6aredP(A5A3)=Ø anddP(A5A6)={8},respectively.We havedP(A5A3A6)=dP(A5A6)/dP(A5A3)={8}.The results for Examples 3 and 4 clearly verify Theorem 1.

Theorem 2[19].The gain ofXABcan be computed based on the gain ofXAand the dPidset ofXAB,as follows:

In Eq.(7),g(XA)is the gain ofXAwhilePk.Valis the profit of productPkin the product dataset.

Example 5.For the example dataset in Tab.1,we know that the pidsets ofA5,A3,andA6arep(A5)={2,3,4,5,6},p(A3)={3,5},andp(A6)={4,6,8},respectively.We also haveg(A5)=$700,g(A3)=$250,andg(A6)=$350.SincedP(A5A3)=Ø,we haveg(A5A3)=g(A5)+.Val=$700.Then,sincedP(A5A6)={8},we haveg(A5A6)=g(A5)+.Val=$800.Finally,sincedP(A5A3A6)={8},we haveg(A5A3A6)=g(A5A3)+.Val=$800.

3 TRK-ECP:A New Algorithm for Mining Top-Rank-k ECPs

3.1 The Problem of Mining Top-Rank-k ECPs

Definition 3.A patternXis in the top-rank-kECPs if and only ifr(X)≤k,where the rank of an ECPXis determined by the following equation:

The problem of mining the top-rank-kECPs is therefore the task of finding the complete set of ECPs for which the rank is no greater thank,wherekis input by the user.

Example 5.For the dataset in Tab.1,letkbe five.Tab.3 below shows all ECPs belonging to the top-5 ECPs from our dataset.We call these the top-rank-5 ECPs.

Table 3:Top-rank-5 ECPs

3.2 TRK-ECP Algorithm

In this section,we present a virtual-threshold-based pruning strategy to accelerate the runtime for mining top-rank-kECPs.We first consider a top-rank-ktable structure calledTR,which is used to store the current top-rank-kECPs.In this table,ranking is done based on ascending order of gain,meaning that the last rank inTRis the largest gain.Hence,whenTRreceiveskranks(i.e.,there arekranks inTR),the virtual threshold (ξ) will be updated based on the gain of the last rank inTR.During the candidate generation procedure,the algorithm will not create a candidate from two ECPs if either of them has a gain greater than the virtual threshold.This strategy helps to reduce the search space and accelerates the runtime,and we refer to this as our virtual-threshold-based pruning strategy.Secondly,we develop the TRK-ECP algorithm based on this strategy,as shown in Algorithm 1.

Example 6.To illustrate the operation of the TRK-ECP algorithm,we apply it to the example dataset in Tab.1,withk=5.In Lines 1-3,the proposed algorithm scans the example dataset to determine the gain and dPidset for each 1-item stored inE1,as shown in Tab.4.Note thatE1is the set of all items in the dataset,without a threshold.

Algorithm 1:TRK-ECP algorithm Input:A product dataset DB and a threshold k Output:TR(the top-rank-k ECPs)1.Let TR ←Ø and Ck ←Ø 2.Scan DB to determine the gain and dPidset for each item,and store them in E1 3.Sort E1 in ascending order of gain 4.Initialize TR from E1 5.Let ξ be g(TR.last_entry)//update ξ based on the gain of the last entry 6.While Ck.count>1 do 7.C ←Candidate_Generation(Ck)8.Sort C in ascending order of gain 9.Ck ←Ø 10.For each c in C do 11.Let ECPs be all the ECPs with rank c in TR 12.For each e in ECPs do 13.If e is a subset of c then 14.Remove ECP from ECPs 15.Insert c into ECPs and update ECPs in TR 16.Insert c into Ck 17.If TR.count>k then 18.Remove the last tuple from TR 19.Let ξ be g(TR.last_tuple)//update ξ based on the gain of the last entry Procedure Candidate_Generation(Ck)1.Let Cnext ←Ø 2.For each cu ∈Ck do 3.For each cv ∈Ck with u<v do 4.If g(cu)>ξ or g(cv)>ξ then 5.Continue;//using the pruning strategy based on a virtual threshold(Continued)

Algorithm 1:Continued 6.If cu and cv are in an equivalence class then 7.c.dPidset=cv.dPidsetcu.dPidset 8.g(c)=g(cu)+∑Pj ∈cv.dPidsetPj.Val 9.If g(c)<=ξ then 10 c=cu ∪cv 11.Add c to Cnext 12.Return Cnext

Table 4:Gains and dPidsets for E1 in the example dataset

In Lines 4-10,the TRK-ECP algorithm inserts the first five items ofE1--,{A3,A6,A7,A4,A5},into the results for the top-rank-kECPs,denoted byTR,and the next-level candidatesCk(which are used to create the next-level candidates).Note thatA1andA2are eliminated,since the results have alreadyk=5 elements.The number of ranks in the results(TR)is five,and the threshold for this step ξ is set to 700.The results after the first step are presented in Tab.5.

Table 5:Top five ranked ECPs after the first step

In the next step (Lines 11-24),TRK-ECP utilizesCkto create the next-level candidates {〈A3A6,600〉,〈A6A7,500〉,〈A6A4,600〉,〈A7A4,600〉},using the Candidate_Generation procedure.Note that{〈A3A7,700〉,〈A3A4,750〉,〈A3A5,700〉,〈A6A5,800〉,〈A7A5,950〉,〈A4A5,950〉} are not generated by the virtual-threshold-based pruning strategy.The TRK-ECP algorithm then sorts the candidates in ascending order of gain,and thus we have the list of candidates{〈A6A7,500〉,〈A3A6,600〉,〈A6A4,600〉,〈A7A4,600〉}.Next,these candidates will be inserted intoTRandCk.This step also removes any non-ECPs that have been added to the results.In our example,this step inserts a new rank with a gain of 500,which includes an ECP{〈A6A7,500〉},while the rank for a gain of 700 is removed fromTR.At a rank of 600,this algorithm removes{A4}from the result,as it is not an ECP.The virtual threshold ξ is set to 600.The results of this step are presented in Tab.6.

Table 6:Top five ranked ECPs after the second step

In the third step,the TRK-ECP algorithm usesCk={〈A6A7,500〉,〈A3A6,600〉,〈A6A4,600〉,〈A7A4,600〉}to create〈A6A7A4,600〉,and inserts〈A6A7A4,600〉intoTRandCk.In this step,{A6A4,A7A4}is a subset ofA6A7A4and has the same gain of 600.The algorithm therefore removes{A6A4,A7A4}from the results.Tab.7 presents the results after this step.

Table 7:Top five ranked ECPs after the third step

In the next step,the set of candidatesCk={〈A6A7A4,600〉}contains only one item;the TRK-ECP algorithm therefore stops,and the final results are as shown in Tab.7.

4 Experiments

The experiments described in this section were conducted on a computer with an Intel Core i5-7200U 2.5 GHz CPU and 16 GBs of RAM.All the experimental algorithms were implemented in C#and were run in the same environment with the.Net Framework Version 4.5.Four public datasets(Chess,Connect,Mushroom,and T10I4D100K) were used to test the effectiveness of the proposed approach.These datasets are frequently used to evaluate pattern mining algorithms in relation to data mining tasks.To create the product datasets,a column generated using a random function was added to store the profit for each product.The features of these datasets are given in Tab.8.

To verify the effectiveness of the TRK-ECP approach,we compare its runtime with those of TEPUS[27]and dVM[26]for the problem of mining the top-rank-kECPs(denoted as TEPUS4ECP and dVM4ECP).Note that there are currently no alternative algorithms for mining the top-rank-kECPs,and we therefore had to modify TEPUS and dVM accordingly.TEPUS4ECP and dVM4ECP were created in two stages,as follows.In the first step,we used TEPUS and dVM to mine the toprank-kEPs.The top-rank-kECPs were obtained in the second step,based on the top-rank-kEPs.The runtimes for TEPUS4ECP and dVM4ECP consist of the sums of the runtimes for the first and second steps.

Table 8:Features of the experimental datasets

In Fig.1,we compare the runtimes of TKR-ECP,TEPUS4ECP and dVM4ECP on the Chess dataset.The results show that TKR-ECP is the fastest and dVM4ECP is the slowest.For example,fork=400,the runtimes for TKR-ECP,TEPUS4ECP,and dVM4ECP are 0.159,0.24,and 1.42 s,respectively.Thus,TKR-ECP is 33.75%times faster than TEPUS4ECP and 88.8%times faster than dVM4ECP on the Chess dataset,for k=400.The total times fork=100,200,300,and 400 for TKRECP,TEPUS4ECP,and dVM4ECP are 0.608,0.84 and 4.15 s.

Figure 1:Runtimes for three experimental methods on the chess dataset

Fig.2 compares the runtimes of TKR-ECP,TEPUS4ECP and dVM4ECP on the Connect dataset.The results show that TKR-ECP is the fastest,and dVM4ECP is the slowest.For example,fork=300,the runtimes for TKR-ECP and TEPUS4ECP are 4.16 and 6.28 s,while dVM4ECP cannot run at this threshold.The total times fork=100,200,300,and 400 for TKR-ECP and TEPUS4ECP are 14.88 and 21.97 s.Hence,TKR-ECP is 33.75%times faster than TEPUS4ECP on this dataset.

Fig.3 compares the runtimes of TKR-ECP,TEPUS4ECP and dVM4ECP for the Mushroom dataset.The results show that the times for all three algorithms are the same for all thresholdsk.The total times fork=100,200,300,and 400 for TKR-ECP,TEPUS4ECP,and dVM4ECP on the Mushroom dataset are 1.55,1.98 and 10.32 s,respectively.

Figure 2:Runtimes for three experimental methods on the connect dataset

Figure 3:Runtimes for three experimental methods on the mushroom dataset

Fig.4 compares the runtimes of TKR-ECP,TEPUS4ECP and dVM4ECP on the T10I4D100K dataset.The results show that TKR-ECP is the fastest and dVM4ECP is the slowest in terms of runtime.For example,fork=200,the runtimes for TKR-ECP,TEPUS4ECP,and dVM4ECP on the T10I4D100K dataset are 3.58,3.64,and 5.36 s,respectively,meaning that TKR-ECP is 1.6% times faster than TEPUS4ECP and 33%times faster than dVM4ECP for the T10I4D100K dataset with this threshold.The total times fork=100,200,300,and 400 for TKR-ECP,TEPUS4ECP,and dVM4ECP for this dataset are 15.6,16.22 and 29.32 s,respectively.

To test whether our method was the best,we conducted paired t-test statistics for the experimental methods for all four datasets.As we can see from Tab.9,most of thep-values for Chess,Connect and Mushroom were less than 0.05,meaning that our proposed method outperformed dVM4ECP and TEPUS4ECP on these datasets.For T10I4D100K,the differences between the experimental methods were not obvious.

Figure 4:Runtimes for three experimental methods on the T10I4D100K dataset

Table 9: p-values for paired t-test statistics

From the experimental results presented above,we can see that TKR-ECP was the fastest,and dVM4ECP was the slowest.TKR-ECP was faster than TEPUS4ECP for three databases (Chess,Connect,and Mushroom),and the performance of these two methods was equal on the remaining database(T10I4D100K).In addition,the memory usage for TKR-ECP and TEPUS4ECP was roughly equivalent on these experimental databases.Overall,the proposed algorithm outperformed the other experimental approaches for mining top-rank-kECPs in terms of processing time.

5 Conclusion

In this study,we first introduced the concept of mining the top-rank-kECPs,and a fast method for mining top-rank-kECPs was then presented.Our new algorithm combines the mining and ranking of ECPs into a single algorithm and can therefore help to improve the processing time of intelligent systems.We developed a virtual-threshold-based pruning strategy to improve the mining speed for this task and applied this strategy with the dPidset structure to devise the TRK-ECP algorithm for mining the top-rank-kECPs.We have presented detailed,step-by-step examples of how our algorithm works,and conducted experiments to compare the runtime of the proposed algorithm with two modified algorithms (dVM4ECP and TEPUS4ECP) for the mining of top-rank-kECPs.Our experimental results confirm that the proposed algorithm outperforms the other experimental approaches for the mining of top-rank-kECPs in terms of processing time.

In the future,several topics related to the problem of mining EPs and ECPs in incremental databases and data streams will be studied.Moreover,we intend to focus on the problem of mining EPs and ECPs with certain kinds of constraints.

Funding Statement:The authors received no specific funding for this study.

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.