A Parallel Approach to Discords Discovery in Massive Time Series Data

2021-12-15 12:48MikhailZymblerAlexanderGrentsYanaKraevaandSachinKumar
Computers Materials&Continua 2021年2期

Mikhail Zymbler,Alexander Grents,Yana Kraeva and Sachin Kumar

Department of Computer Science, South Ural State University, Chelyabinsk, 454080,Russian

Abstract:A discord is a refinement of the concept of an anomalous subsequence of a time series.Being one of the topical issues of time series mining, discords discovery is applied in a wide range of real-world areas (medicine, astronomy,economics, climate modeling, predictive maintenance, energy consumption,etc.).In this article, we propose a novel parallel algorithm for discords discovery on high-performance cluster with nodes based on many-core accelerators in the case when time series cannot fit in the main memory.We assumed that the time series is partitioned across the cluster nodes and achieved parallelization among the cluster nodes as well as within a single node.Within a cluster node,the algorithm employs a set of matrix data structures to store and index the subsequences of a time series, and to provide an efficient vectorization of computations on the accelerator.At each node,the algorithm processes its own partition and performs in two phases, namely candidate selection and discord refinement, with each phase requiring one linear scan through the partition.Then the local discords found are combined into the global candidate set and transmitted to each cluster node.Next, a node performs refinement of the global candidate set over its own partition resulting in the local true discord set.Finally,the global true discords set is constructed as intersection of the local true discord sets.The experimental evaluation on the real computer cluster with real and synthetic time series shows a high scalability of the proposed algorithm.

Keywords: Time series; discords discovery; computer cluster; many-core accelerator;vectorization

1 Introduction

Currently,the discovery of anomalous subsequences in a very long time series is a topical issue in a wide spectrum of real-world applications,namely medicine,astronomy,economics,climate modeling,predictive maintenance, energy consumption, and others.For such applications, it is hard to deal with multi-terabyte time series,which cannot fit into the main memory.

Keogh et al.[1]introduced HOTSAX,the anomaly detection algorithm based on the discord concept.Adiscordof a time series can informally be defined as a subsequence that has the largest distance to its nearest non-self match neighbor subsequence.A discord looks attractive as an anomaly detector because it only requires one intuitive parameter (the subsequence length), as opposed to most anomaly detection algorithms,which typically require many parameters[2].HOTSAX,however,assumes that time series reside in main memory.

Further,Yankov,Keoghet al.proposed a disk-aware algorithm(for brevity,referred to as DADD,Disk-Aware Discord Discovery) based on therange discordconcept [3].For a given ranger, DADD finds all discords at a distance of at leastrfrom their nearest neighbor.The algorithm requires two linear scans through the time series on a disk.Later, Yankov, Keogh et al.[4] discussed parallelization of DADD based on the MapReduce paradigm.However, in the experimental evaluation, the authors just simulated the above-mentioned parallel implementation on up to eight computers.

Our research is devoted to parallel and distributed algorithms for time series mining.In the previous work [5], we parallelized HOTSAX for many-core accelerators.This article continues our study and contributes as follows.We present a parallel implementation of DADD on the high-performance cluster with the nodes based on many-core accelerators.The original algorithm is extended by a set of index structures to provide an efficient vectorization of computations on each cluster node.We carried out the experiments on the real computer cluster with the real and synthetic time series, which showed a high scalability of our approach.

The rest of the article is organized as follows.In Section 2,we give the formal definitions along with a brief description of DADD.Section 3 provides the brief state of the art literature review.Section 4 presents the proposed methodology.In Section 5,the results of the experimental evaluation of our approach have been provided.Finally,Section 6 summarizes the results obtained and suggests directions for a further research.

2 Problem Statement and the Serial Algorithm

2.1 Notations and Definitions

Below,we follow[4]to give some formal definitions and the statement of the problem.

Atime series Tis a sequence of real-valued elements:T=(t1,...,tm),ti∈R.The length of a time series is denoted by|T|.

Asubsequence Ti,nof a time seriesTis its contiguous subset ofnelements that starts at positioni:Ti,n=(ti,...,ti+n-1), 1 ≤n≪m, 1 ≤i≤m-n+1.We denote the set of all subsequences of lengthninTbySnT.LetNdenotes the number of subsequences in

Adistance functionfor any two subsequences is a nonnegative and symmetric function Rn×Rn→R.

Two subsequencesTi,n,Tj,n∈arenon-trivial matches[6]with respect to a distance function Dist,ifand DistLet us denote a non-trivial match of a subsequenceC∈SnTbyMC.

A subsequenceD∈SnTis said to be themost significant discordinTif the distance to its nearest nontrivial match is the largest.That is,

A subsequenceD∈SnTis called themost significant k-th discordinTif the distance to itsk-th nearest non-trivial match is the largest.

Given a positive parameterr,the discord at a distance at leastrfrom its nearest neighbor is called therange discord,i.e.,for discordDmin(Dist(D,MD))≥r.

DADD,the original serial disk-based algorithm[4]addresses discovering range discords,and provides researchers with a procedure to choose therparameter.To accelerate the above-mentioned procedure, our parallel algorithm[5]for many-core accelerators can be applied,which discovers discords for the case when time series fit in the main memory.

When computing distance between subsequences, DADD demands that the arguments have been previously z-normalized to have mean zero and a standard deviation of one.Here,z-normalization of a subsequenceC∈SnTis defined as a subsequencein which

In the original DADD algorithm,the Euclidean distance is used as a distance measure yet the algorithm can be utilized with any distance function, which may not necessarily be a metric [4].Given two subsequencesX,Y∈SnT, the Euclidean distance between them is calculated as

2.2 The Original Algorithm

The DADD algorithm [4] performs in two phases, namely the candidate selection and discord refinement, with each phase requiring one linear scan through the time series on disk.Algorithm 1 depicts a pseudo code of DADD (up to the replacement of the Euclidean distance by an arbitrary distance function).The algorithm takes time seriesTand rangeras an input and outputs set of discords C.For a discordc∈C,we denote the distance to its nearest neighbor asc.dist.

At the first phase, the algorithm scans through the time seriesT, and for each subsequences∈SnTit validates the possibility for each candidatecalready in the set C to be discord.If a candidatecfails the validation, then it is removed from this set.In the end,the newsis either added to the candidates set, if it is likely to be a discord,or it is discarded.The correctness of this procedure is proved in Yankov et al.[4].

At the second phase,the algorithm initially sets distances of all candidates to their nearest neighbors to infinity.Then, the algorithm scans through the time seriesT, calculating the distance between each subsequences∈SnTand each candidatec.Here, when calculating EDs,c( ), the EarlyAbandonED procedure stops the summation of ∑if it reachesk=ℓ, such that 1 ≤ℓ ≤nfor whichIf the distance is less thanrthen the candidate is false positive and permanently removed from C.If the above-mentioned distance is less than the current value ofc.dist(and still greater thanr,otherwise it would have been removed)then the current distance to the nearest neighbor is updated.

3 Related Work

Being introduced in Keogh et al.[1], currently, time-series discords are considered one of the best techniques for the time series anomaly detection[7].

The original HOTSAX algorithm [1] is based on the SAX (Symbolic Aggregate ApproXimation)transformation [8].Among the improvements of HOTSAX, we can mention the following algorithms,namelyiSAX [9] and HOT-iSAX [10] (indexable SAX), WAT [11] (Haar wavelets instead of SAX),HashDD [12] (use of a hash table instead of the prefix trie), HDD-MBR [13] (application of R-trees), and BitClusterDiscord [14] (clustering of the bit representation of subsequences).However, the abovementioned algorithms are able to discover discords if the time series fits in the main memory, and have no parallel implementations,to the best of our knowledge.

Further, Yankov, Keogh et al.[3] overcame the main memory size limitation having proposed a diskaware discord discovery algorithm (DADD) based on therange discordconcept.For a given ranger,DADD finds all discords at a distance of at leastrfrom their nearest neighbor.The algorithm performs in two phases, namely the candidate selection and discord refinement, with each phase requiring one linear scan through the time series on the disk.

There are a couple of worth-noting works devoted to parallelization of DADD.The DDD(Distributed Discord Discovery) algorithm [15] parallelizes DADD through a Spark cluster [16] and HDFS (Hadoop Distributed File System) [17].DDD distributes time series onto the HDFS cluster and handles each partition in a memory of a computing node.As opposed to DADD, DDD computes the distance without taking advantage of an upper bound for early abandoning, which would increase the algorithm’s performance.

The PDD (Parallel Discord Discovery) algorithm [18] also utilizes a Spark cluster but employs transmission of a subsequence and its non-trivial matches to one or more computing nodes to calculate the distance between them.A bulk of continuous subsequences is transmitted and calculated in a batch mode to reduce the message passing overhead.PDD is not scalable since intensive message passing between the cluster nodes leads to a significant degradation of the algorithm’s performance as the number of nodes increases.

In their further work [4], Yankov, Keoghet al.discussed the parallel version of DADD based on the MapReduce paradigm (hereinafter referred to as MR-DADD), and the basic idea is as follows.Let the input time seriesTbe partitioned evenly acrossPcluster nodes.Each node performs the selection phase on its own partition with the samerparameter and produces distinct candidate set Ci.Then the combined candidate set CPis constructed as CP=∪Pi=1Ciand transmitted to each cluster node.Next, a node performs the refinement phase on its own partition taking CPas an input, and produces the refined candidate set Ci.The final discords are given by the set CP=∩Pi=1Ci.In the experimental evaluation, the authors, however, just simulated the above-mentioned scheme on up to eight computers resulting in a near-to-linear speedup.

Concluding this brief review, we should also mention the matrix profile (MP) concept proposed by Keogh et al.[19].MP is a data structure that annotates a time series, and can be applied to solve an impressively large list of time series mining problems including discords discovery but at computational cost of Om2( ) wheremis the time series length [19,20].Recent parallel algorithms of the MP computation include GPU-STAMP [19] and MP-HPC [21], which are implementations for graphic processors through CUDA(Compute Unified Device Architecture)technology and computer cluster through MPI (Message Passing Interface) technology,respectively.

4 Discords Discovery on Computer Cluster with Many-Core Accelerators

The parallelization employs a two-level parallelism,namely across cluster nodes and among threads of a single node.We implemented these levels through partitioning of an input time series and MPI technology,and OpenMP technology,respectively.Within a single node,we employed the matrix representation of data to effectively parallelize computations through OpenMP.Below, we will show an approach to the implementation of these ideas.

4.1 Time Series Representation

To provide parallelism at the level of the cluster nodes,we perform time series partitioning across the nodes as follows.LetPbe the number of nodes in the cluster, thenk-th partition (0 ≤k≤P-1) of the time series is defined asTstart,lenwhere

This means the head part of every partition except the first overlaps with the tail part of the previous partition inn-1 data points.Such a technique prevents us from a loss of the resulting subsequences in the junctions of two neighbor partitions.To simplify the presentation of the algorithm, hereinafter in this section, we use symbolTand the above-mentioned related notions implying a partition on the current node but not the whole input time series.

The time series partition is stored as a matrix of aligned subsequences to enable computations over aligned data with as many auto-vectorizable loops as possible.We avoid the unaligned memory access since it can cause an inefficient vectorization due to time overhead for the loop peeling[22].

Let us denote the number of floats stored in the VPU (vector processing unit of the many-core accelerator) bywidthVPU.If the discord lengthnis not a multiple ofwidthVPU, then each subsequence is padded with zeroes where the number of zeroes is calculated aspad=widthVPU-(nmodwidthVPU).Thus,the aligned(and previously z-normalized) subsequenceis defnied as follows:

Thesubsequence matrixis defined as

4.2 Internal Data Layout

The parallel algorithm employs the data structures depicted in Fig.1.Defining structures to store data in the main memory of a cluster node,we suppose that each structure is shared by all threads the algorithm is running on, and each thread processes its own data segment independently.Let us denote the amount of threads employing by the algorithm on a cluster node byp, and letiam(0 ≤iam≤p-1) denotes the number of the current thread.

Figure 1:Data layout of the algorithm

Set of discords C is implemented as an object with two basic attributes, namelycandidate indexandcandidate body,to store indices of all potential discord subsequences and their values themselves,respectively.

Let us denote a ratio of candidates selected at a cluster node and all subsequences of the time series by ξ.The exact value of the ξ parameter is a subject of an empirical choice.In our experiments,ξ=0.01 was enough to store all candidates.Thus,we denote the number of candidates asL= ξ·Nand assume thatL≪N.

Thecandidate indexis organized as a matrix C.index∈Np×L,which stores indices of candidates in the subsequence matrixSnTfound by each thread, i.e.,i-th row keeps indices of the candidates that have been found byi-th thread.Initially,the candidate index is filled by NULL values.

To provide a fast access to the candidate index during the selection phase,it is implemented as a deque(double-ended queue) with three attributes, namelycount,head, andtail.Thedeque countis an array C.count∈Np, which for each thread keeps the amount of non-NULL elements in the respective row of the candidate index matrix.Thedeque headandtailare arrays C.headand C.tail∈Np, respectively,which represent the second-level indices that for each thread keep a number of column in C.indexwith the most recent NULL value,and with the least recent non-NULL value,respectively.

LetH(H<L≪N)be the number of candidates selected at a cluster node during the algorithm’s first phase.Then thecandidate bodyis the matrix C.cand∈RH×n,which represents the candidate subsequences itself.The candidate body is accompanied by an array C.pos∈NH,which stores starting points of candidate subsequences in the input time series.

After the selection phase, all the nodes exchange the candidates found to construct the combined candidate set, so at each cluster node the candidate body will contain potential discords from all the nodes.At the second phase, the algorithm refines the combined candidate set comparing the parameterrand distances between each element of the candidate body and each element of the subsequence matrix.

To parallelize this activity,we process rows of the subsequence matrix in the segment-wise manner and employ an additional attribute of the candidate body, namelybitmap.Thebitmapis organized as a matrix C.bitmap∈Bp×H, which indicates the fact that an element of the candidate body has been successfully validated against all elements in a segment of the subsequence matrix.Thus, after the algorithm’s second phase,i-th element of the candidate body is successfully validated ifbitmapℓ,i( )is true.

4.3 Parallel Implementation of the Algorithm

In the implementation,we apply the following parallelization scheme at the level of the cluster nodes.Let the input time seriesTbe partitioned evenly acrossPcluster nodes.Each node performs the selection phase on its own partition with the same threshold parameterrand produces distinct candidate set Ci.

Next, as opposed to MR-DADD [4], each node refines its own candidate set Ciwith respect to thervalue.Indeed, a candidate cannot be the true discord if it is pruned in the refinement phase in at least one cluster node.Thus, by the local refinement procedure, we try to reduce each candidate set Ciand, in turn,the combined candidate set CP=.In the experiments, this allows us to reduce the size of the combined candidate set at times.

Then the combined candidate set CPis constructed and transmitted to each cluster node.Next,a node refines CPover its own partition,and produces the result Ci.Finally,the true discords set is constructed as

The parallel implementation of the candidate selection and refinement phases is depicted in Algorithm 2 and Algorithm 3, respectively.To speed up the computations at a cluster node, we omit the square root calculation since this does not change the relative ranking of the candidates (indeed, the ED function is monotonic and concave).

Algorithm 2:Parallel Candidate Selection (in T,r;out C)1:#pragma omp parallel 2: iam ←omp_get_thread_num()3:#pragma omp for 4:for i from 1 to N do 5: isCand ←TRUE 6:for j from 1 to C.tail iam( )do 7:if C.index iam,j( )=NULL or C.index iam,j( )-i ||<n then 8: continue 9:if ED2 SnT i,·( ), SnT C.index iam,i( ),·()()<r2 then 10: isCand ←FALSE;C.count iam( )←C.count iam( )-1 11: C.index iam,j( )←NULL;C.head iam( ) ←j 12:if isCand then 13:C.count iam( )←C.count iam( )+1 14:if C.index iam, C.head iam( )()= NULL then 15: C.index iam, C.head iam( )()←i 16:else 17: C.index iam, C.tail iam( )()←i;C.tail iam( )←C.tail iam( )+1 18:return C

In the selection phase,we parallelize the outer loop along the rows of the subsequence matrix while in the inner loop along the candidates,each thread processes its own segment of the candidate index.By the end of the phase,the candidates found by each thread are placed into the candidate body,and all the cluster nodes exchange the resulting candidate bodies by the MPI_Send and MPI_Recv functions to form the combined candidate set,which serves as an input for the second phase.

In the refinement phase,we also parallelize the outer loop along the rows of the subsequence matrix,and in the inner loop along the candidates, each thread processes its own segments of the candidate body and bitmap.In this implementation, we do not use the early abandoning technique for the distance calculation relying on the fact that vectorization of the square of the Euclidean distance may give us more benefits.By the end of the phase, the column-wise conjunction of the elements in the bitmap matrix will result in a set of true discords found by the current cluster node.An intersection of such sets is implemented by one of the cluster nodes where the rest nodes send their resulting sets.

Algorithm 3:Parallel Discord Refinement(in T,r;in out C)1:C.bitmap ←TRUEp × H 2:#pragma omp parallel 3: iam ←omp_get_thread_num()4:#pragma omp for 5:for i from 1 to N do 6:for j from 1 to H do 7:C.bitmap iam,j( )←C.bitmap iam,j( ) and ED2 SnT i,·( ), C.cand j,·( )()≥r2()8:return C

5 Experiments

We evaluated the proposed algorithm during the experiments conducted on the Tornado SUSU computer cluster [23] with the nodes based on the Intel MIC accelerators [24].Each cluster node is equipped by the Intel Xeon Phi SE10X accelerator with a peak performance 1.076 TFLOPS (60 cores at 1.1 GHz with hyper-threading factor 4×).In the experiments, we investigated scalability of our approach and compared it with analogs, and the results are given below in Sections 5.1 and 5.2,respectively.

5.1 The Algorithm’s Scalability

In the first series of the experiments,we assessed the algorithm’s scaled speedup,which is defined as the speedup obtained when the problem size is increased linearly with the number of the nodes added to the computer cluster [25].Being applied to our problem, the algorithm’s scaled speedup is calculated as

wherenis the discord length,Pis the number of the cluster nodes,mis a factor of the time series length,CP·mis a set of all the candidate discords selected by the algorithm at its first phase from a time series of lengthP·mandtP·P·mis the algorithm’s run time when the time series is processed onPnodes.

For the evaluation,we took ECG time series[26](see Tab.1 for the summary of the data involved).In the experiments,we discovered discords on up to 128 cluster nodes with the time series factorm=106,and varied the discord’s lengthnwhile the range parameterrwas chosen empirically to provide the algorithm’s best performance.

The results of the experiments are depicted in Fig.2.As can be seen, our algorithm adapts well to increasing both the time series length and number of cluster nodes, and demonstrates the linear scaled speedup.As expected, the algorithm shows a better scalability with larger values of the discord length because this provides a higher computational load.

Table 1:Time series involved in the experiments on the algorithm’s scalability

Figure 2:The scaled speedup of the algorithm

5.2 Comparison with Analogs

In the second series of the experiments, we compared the performance of our algorithm against the analogs we have already considered in Section 3, namely DDD [15], MR-DADD [4], GPU-STAMP [19],and MP-HPC [21].We omit the PDD algorithm [18] since in our previous experiments [5], PDD was substantially far behind our parallel in-memory algorithm due to the overhead caused by the message passing among the cluster nodes.

Throughout the experiments,we used the synthetic time series generated according the Random Walk model[27]as that ones were employed for the evaluation by the competitors.For comparison purposes,we used the run times reported by the authors of the respective algorithms.To perform the comparison,we ran our algorithm on Tornado SUSU with a reduced number of nodes and cores at a node to make the peak performance of our hardware platform approximately equal to that of the system on which the corresponding competitor was evaluated.

Tab.2 summarizes the performance of the proposed algorithm compared with the analogs.We can see that our algorithm outruns its competitors.As expected,direct analogs DDD and MR-DADD are inferior to our algorithm since they do not employ parallelism within a single cluster node.Additionally, indirect analogs GPU-STAMP and MP-HPC are behind our algorithm since they initially aim to solve a computationally more complex problem of computing the matrix profile, which can also be used for discords discovery among many other time series mining problems.

Table 2:Comparison of the proposed algorithm with analogs

6 Conclusion

In this article,we addressed the problem of discovering the anomalous subsequences in a very long time series.Currently, there is a wide spectrum of real-world applications where it is typical to deal with multiterabyte time series, which cannot fit in the main memory:medicine, astronomy, economics, climate modeling, predictive maintenance, energy consumption, and others.In the study, we employ the discord concept, which is a subsequence of the time series that has the largest distance to its nearest non-self match neighbor subsequence.

We proposed a novel parallel algorithm for discords discovery in very long time series on the modern high-performance cluster with the nodes based on many-core accelerators.Our algorithm utilizes the serial disk-aware algorithm by Yankov, Keogh et al.[4] as a basis.We achieve parallelization among the cluster nodes as well as within a single node.At the level of the cluster nodes, we modified the original parallelization scheme that allowed us to reduce the number of candidate discords to be processed.Within a single cluster node, we proposed a set of the matrix data structures to store and index the subsequences of a time series, and to provide an efficient vectorization of computations on the many-core accelerator.

The experimental evaluation on the real computer cluster with the real and synthetic time series shows the high scalability of the proposed algorithm.Throughout the experiments on real computer cluster over real and synthetic time series, our algorithm showed the linear scalability, increasing in the case of a high computational load due to a greater discord length.Also, the algorithm’s performance was ahead of the analogs that do not employ both computer cluster and many-core accelerators.

In further studies,we plan to elaborate versions of the algorithm for computer clusters with GPU nodes.

Funding Statement:This work was financially supported by the Russian Foundation for Basic Research(Grant No.20-07-00140) and by the Ministry of Science and Higher Education of the Russian Federation(Government Order FENU-2020-0022).

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.