Rui Wang ,Yao Zhou ,Guangchun Luo ,Peng Chen and Dezhong Peng
1School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu,611731,China
2School of Computer and Software Engineering,Xihua University,Chengdu,610039,China
3School of Computer Science,Sichuan University,Chengdu,610065,China
4National Innovation Center for UHD Video Technology,Chengdu,610095,China
ABSTRACT Time series anomaly detection is crucial in various industrial applications to identify unusual behaviors within the time series data.Due to the challenges associated with annotating anomaly events,time series reconstruction has become a prevalent approach for unsupervised anomaly detection.However,effectively learning representations and achieving accurate detection results remain challenging due to the intricate temporal patterns and dependencies in real-world time series.In this paper,we propose a cross-dimension attentive feature fusion network for time series anomaly detection,referred to as CAFFN.Specifically,a series and feature mixing block is introduced to learn representations in 1D space.Additionally,a fast Fourier transform is employed to convert the time series into 2D space,providing the capability for 2D feature extraction.Finally,a cross-dimension attentive feature fusion mechanism is designed that adaptively integrates features across different dimensions for anomaly detection.Experimental results on real-world time series datasets demonstrate that CAFFN performs better than other competing methods in time series anomaly detection.
KEYWORDS Time series anomaly detection;unsupervised feature learning;feature fusion
Anomaly detection aims to find data points that significantly deviate from other samples in the same data group[1],which has been widely studied in diverse research areas and application domains[2,3].Time series data tracks samples over time in temporal order,collected using field-deployed sensors that monitor the status of systems or services in the manufacturing industry [4].Detecting anomalies in time series is a crucial task for monitoring various statuses and assisting the failure troubleshooting[5],thus preventing system failure and reducing system maintenance costs[6].In the unsupervised setting,it is expected to detect anomalous events in time series without annotation.Given the advances in sensing technology,collecting time series data has become easier and faster in various fields[7],thus,there is an urgent need to develop effective methods that can precisely detect anomalies in time series.
Given the intricate dependencies among elements in time series data,many existing models struggle to capture complex relationships,leading to suboptimal detection performance.Following this research direction,many efforts have been made to mitigate this gap.In particular,machine learning methods[8,9]have been investigated in time series anomaly detection.In[10],a fast Fourier transform is utilized to extract features,and a Bacterial Foraging Algorithm(BFA)-Gaussian support vector classifier machine (GSVCM) model is introduced for analyzing electromyography (EMG)signals.In[11],spectrogram features,including short-time Fourier transform(STFT)and continuous wavelet transform(CWT),are exploited.Subsequently,a time-frequency approach called modified Stransform is introduced,which studies the phase coupling between two or more different spatially recorded entities with non-stationary characteristics.Also,K-Nearest Neighbor (KNN) [12] and Random Forest(RF)[13]are explored for time series anomaly detection using Dynamic Time Warping(DTW)or Principal Component Analysis(PCA)pre-processed features.Although these methods have made considerable progresses in time series anomaly detection,the need for domain knowledge in feature extraction generally limits their capability to capture dependencies and complex patterns in time series data,thereby impeding anomaly detection performance.
Deep learning methods have demonstrated significant success in various domains[14,15],including computer vision [16],speech recognition [17] and natural language processing [18].This trend has garnered considerable attention in the field of time series analysis [19].In [20],a Convolutional Neural Network (CNN) combined with a multiclass SVM is adopted for early anomaly diagnosis problems,showing superior performance compared to conventional SVM,KNN and traditional CNN.In another study [21],CNN and Recurrent Neural Network (RNN) are integrated to detect anomalies in Internet of Things (IoT) [22] time series,in which spatial and temporal features are extracted by CNN and recurrent autoencoder,respectively.In [23],a multi-head CNN-RNN model is assessed against a real industrial case study,processing each sensor with independent convolutions and requiring no pre-processing.Additionally,Generative Adversarial Networks(GANs)have been explored for time series anomaly detection in an unsupervised manner[24],where 1D CNN and Gated Recurrent Unit (GRU) are adopted as generator and discriminator,respectively.After training,the reconstruction error can serve as an informative indicator for determining whether a certain sample of the time series is anomalous.Furthermore,Variational AutoEncoders(VAEs)[25]have been adopted for learning lower-dimensional latent representations of video object trajectories,and then anomaly identification is achieved based on reconstruction loss.Given the advantage of automatic feature learning,deep learning models have made noticeable progresses in time series anomaly detection and considerably improved the performance.However,because of the locality property of convolution and sequential computation paradigm of recurrent models[26],capturing long-term dependencies remains challenging,presenting an obstacle to further improving the performance.
Recently,Transformers utilizing attention mechanism[27]have gained widespread use for modeling sequential data [17,18],achieving impressive results and outperforming RNNs and CNNs in natural language processing and computer vision tasks[18,28].The attention mechanism allows the model to simultaneously focus on essential parts while ignoring the irrelevant segments in the sequence,independent of the sequence length,thus providing superior performance in modeling long sequences.In time series analysis,Transformer-based models have also demonstrated their effectiveness in capturing the long-term temporal dependencies among time points [29–31].However,as the main working power of Transformers,the multi-head self-attention mechanism is permutation-invariant to some extent.Since time series analysis is inherently sensitive to the order of a continuous set of points,it inevitably suffers from temporal information loss,resulting in inferior performance [32].In this paper,we propose a cross-dimension attentive feature fusion network called CAFFN.This network automatically extracts features from raw data in multiple dimensions,each possessing different levels of locality properties.These features are then adaptively fused for time series anomaly detection.This design enables the learning of local and global temporal representations in 1D and 2D spaces,allowing for more effective capture of the complex patterns and dependencies in time series.The contributions of this paper are summarized as follows:
• A cross-dimension attentive feature fusion network model is proposed,which learns and fuses time series features from multiple dimensions with different levels of locality properties.
• A mixing strategy is introduced to model the dependencies at both series and feature levels,which can effectively capture the correlations in time series data.
• Evaluation on benchmark datasets shows that the proposed CAFFN achieves superior performance compared to other competing time series anomaly detection methods.
Anomaly detection refers to the identification of patterns in data that do not conform to the expected behavior,a challenge that has been actively explored for several decades[33].Due to its broad applicability in diverse domains such as financial surveillance,risk management,health and medical risk,and AI safety,anomaly detection plays an increasingly crucial role in real-world scenarios.Various detectors have been investigated,which have made considerable progress in anomaly detection tasks.Supervised models assume the availability of a training dataset with labeled instances for both normal and anomaly classes[34].However,obtaining accurate and representative labels can be challenging.As a trade-off solution,semi-supervised anomaly detection assumes that the training data has labeled instances only for the normal class,thus obviating the need for labels for the anomaly class [34].In an extreme case,unsupervised anomaly detection [35,36] does not require any labeled data,with the assumption that normal instances are far more frequent than anomalies in the test data [37].Consequently,a variety of machine learning and deep learning methods have been developed for anomaly detection.
Machine learning has been widely investigated for time series analysis.For instance,the autocorrelation function and spectrum of the stationary process were explored to learn temporal dynamics[38],and exponential smoothing [39] was used with Fourier functions of time to model seasonality in time series.Also,exponentially weighted moving averages were adopted for sales data analysis[40].Considering the irregular properties of time series,the Kalman filter was introduced to deal with situations in which the observations were irregularly spaced[41].In addition,Gaussian processes and their extension deep Gaussian processes[42]have been employed for time-series prediction[43],providing a probabilistic means to model the temporal patterns in sequential data.Support vector regression[44,45]has demonstrated superiority over other nonlinear techniques,such as multi-layer perceptions,especially when dealing with time series data sampled from nonlinear and non-stationary system processes.Furthermore,K-Nearest Neighbor(KNN)[12]and Random Forest(RF)[13]have also been assessed for time series anomaly detection tasks.Although machine learning methods have achieved considerable success in this field,the performance of machine learning pipelines heavily relies on domain knowledge and handcrafted feature design,such as short-time Fourier transform and continuous wavelet transform[10,11].This inevitably leads to information loss,hampering detection performance.
By stacking multiple layers and imposing connection restrictions,deep neural networks [46–48]have shown remarkable potential in learning nonlinear mapping and features from raw time series data without any prior domain knowledge.In [49],Hierarchical Temporal Memory (HTM) was employed for anomaly detection in streaming applications,where an online processing paradigm was presented for handling streaming data from sensors.In[50],the Restricted Boltzmann Machine(RBM) was utilized to learn system-wide patterns in distributed cyber-physical systems in a datadriven fashion.It demonstrated its capability to capture multiple nominal modes with one energybased probabilistic graphical model.To capture the complex temporal dependence and stochasticity of multivariate time series,gated recurrent unit and variational autoencoder were introduced,and the reconstruction probabilities based on the learned representations were used to determine anomalies in an unsupervised fashion[51].In[52],a temporal hierarchical one-class network was proposed.It utilizes a dilated recurrent neural network with multi-resolution recurrent skip connections to extract multi-scale features;the difference between fused features and hypersphere centers is exploited for endto-end training and determining anomaly scores for unseen time series data.However,the sequential computation paradigm of recurrent models is prone to gradient-vanishing and error accumulation problems for long sequences,and also suffers from capturing global representations.Then,efforts have shifted towards developing Transformer-based models for time series anomaly detection [53].Meanwhile,pure MLP architectures have shown promising performance compared to Transformer models on vision tasks [54].Yet,their effectiveness in time series anomaly detection tasks is yet to be explored.Generative adversarial networks have also been proposed[55]to model the distribution of time series data.However,training GAN is usually unstable and prone to mode collapse issues.Recent investigations have also revealed that CNNs are promising for capturing time series features in 2D space[26],and linear models surprisingly remain competitive in time series analysis tasks[32].This has inspired us to explore effective network architectures that can learn features across different dimensions for time series anomaly detection.
The main framework of the proposed method is shown in Fig.1.The anomaly score is computed by differentiating the reconstructed data and the input time series data,as indicated by the direct link from the input layer to the one after the FC layer.The main assumption is that the trained model can sufficiently learn the representation of normal time series,which predominates in the dataset,while information about anomalies is lost during training due to the lack of samples.Since the time series is naturally in 1D form,its features in 1D space are crucial to discover the specific temporal anomalous patterns.Although self-attentions have been actively adopted for modeling temporal relations in an ordered set of continuous points,recent findings indicate that this mechanism can be inferior even to linear models in time series forecasting tasks [32].In this study,we propose to use mixing as an alternative strategy to learn the 1D feature from time series data.This allows communication between different time steps at the series data level,as well as communication between different channels at the feature level.In the right part of Fig.1,different colors in the feature block after the first layer norm indicate features at different time steps.The color changes when features at different time steps are fused by the series mixing module or features at different dimensions are fused by the feature mixing module.Besides,it is widely recognized that real-world time series typically exhibit inherent periodicity,such as daily and yearly weather observations and weekly and quarterly records for electricity consumption.The complex periodicity property makes it challenging to model the temporal feature in 1D space due to the complicated interaction between multiple periods.Therefore,we further incorporate features from 2D space to enhance the representation learning capability for the time series anomaly detection task.To effectively utilize these features for detecting anomalous events in time series,a cross-dimensional feature fusion strategy is designed,which is elaborated in the following subsections.
Figure 1:Main framework of the proposed method(Tr means transpose)
To extract time series features from 1D space,the mixing block is adopted,as indicated in Fig.1.The series mixing process involves layer normalization,transpose operation,an autoencoder-like MLP for communication,skip-connections,and a final transpose operation to maintain consistent feature sizes.Formally,for a given input time seriesX∈RT×D,the mixing branch begins with layer normalization,followed by a transpose operation to exchange the feature and time dimensions.This operation results in Tr(LayerNorm(X))∈RD×T.For series mixing,an autoencoder-like multi-layer perceptron(MLP)is employed to communicate information across different time steps.Since the input and output sizes of the series mixing module are set to be the same,the first MLP output is of shape RD×T.A skip connection is introduced to alleviate the gradient vanishing issue and model training difficulty.However,directly adding the skip connection poses a challenge due to the mismatch in size between the mixed feature from the first MLP output and the inputX.This paper applies another transpose operation,converting the feature shape back to RT×D.This series mixing process can be described as follows:
whereσis an GELU[56]element-wise nonlinearity,Wsandare the MLP connections weights.
Communication between different points in a time series is analogous to the self-attention information flow mechanism,where each token in the sequence is visible to all other tokens.Therefore,this model can effectively capture long-term dependencies.Feature mixing is further introduced to model the correlation between different feature channels.First,layer normalization is applied to the output obtained from the series mixing step.Subsequently,another autoencoder-like MLP is employed to facilitate the exchange of information in the intermediate features,specifically focusing on the dimension of feature channels.The feature mixing process can be described as:
whereWfandare the MLP connections weights.
Compared to the self-attention mechanism,one of the advantages of this design lies in its computational complexity.In self-attention-based Transformer models,the complexity is quadratic with respect to the number of tokens [18].Conversely,in the series and feature mixing modules,the complexity is linear with respect to the input length.Moreover,the series and feature mixing strategy facilitates the information exchange between different feature channels.This capability can be beneficial for learning feature representation in multivariate time series,given their intrinsic correlations.
Although learning features in 1D dimension is a straightforward option for time series anomaly detection,anomalous patterns in real-world time series are often too complex to be adequately captured in their natural form.Given the widespread presence of complex periodicity in time series data,converting it to 2D space can be beneficial for handling this periodic complexity.Motivated by this consideration,the Fast Fourier Transform(FFT)is employed to extract the frequency information of a given time series data.This process helps discover the periodicity,allowing the conversion of the 1D time series into 2D space.The main idea of temporal 2D feature extraction is depicted in Fig.2.Formally,given a 1D time seriesX∈RT×D,
Figure 2:Conversion of time series from 1D to 2D using FFT
Here,FFT(·)denotes the FFT function,andAmplitude(·)represents the calculation of amplitude values.Freqmaxrepresents the frequency with the largest amplitude.The period of the time series can be directly obtained as.Based on the frequency and the period length,the 1D time series can be converted to 2D space by splitting and stacking.Specifically,we split the 1D time series into shorter segments with a length of,and then stack them in a newly created dimension.For a 1D time series with lengthNand discovered periodP,the converted 2D form has a shapeP×.This process can be summarized as:
For the 1D representations,we utilize the series and feature mixing strategy to extract abstract high-level patterns.For 2D representations,we employ stacked separable convolutions [57] to learn features.This approach facilitates communication between different periods,making it less vulnerable to complex periodicity properties.After the convolution,the features are flattened into 1D representations,formulated as:
As features in different dimensions may possess varying levels of representation ability,we introduce an attentive mechanism for feature fusion,depicted in Fig.3.Three fully connected layers and the sigmoid function are employed for each feature to generate the attention score.These scores are then used to weight the input feature through element-wise multiplication.To break the symmetric structure,one branch further multiplies by a factor of a negative one and adds a positive one.This operation still outputs a value in the range from zero to one,making it compatible with the sigmoid function.This attentive feature fusion process can be formally denoted as:
wherelindicates the block number,F,σand ⊗represent fully connected layers,sigmoid function and element-wise multiplication,respectively.
Figure 3:Feature fusion scheme in the proposed method(FC and σ means fully connected layer and sigmoid function,respectively)
To provide an intuitive understanding of the Eq.(6),we transform it into the following equivalent form.In this form,the output of the sigmoid activation can be viewed as a gate that controls the information flow of a feature branch.
The fused featuresX(l+1)serve as the input to the next 1D and 2D blocks.The output of the final feature fusion block is then mapped by a fully connected layer,computed asO∗=FullyConnected(XL)and can be regarded as the reconstruction of the input time seriesX0.The loss function can then be constructed as:
whereNdenotes the number of time series segments.
The algorithm table is provided in Algorithm 1,which delineates the detailed algorithm steps for training the CAFFN model.We initialize the CAFFN model as depicted in Fig.1,and proceed with the optimization of parameters achieved through the minimization of the loss function defined in Eq.(8).
We assess the performance of CAFFN on three widely used anomaly detection benchmarks obtained from real-world applications.(1)SMD(Server Machine Dataset)[51]is a 5-week-long dataset with 38 dimensions collected from an Internet company.(2) MSL (Mars Science Laboratory rover)[58]is with dimensions of 55,and was collected by NASA.(3)SMAP(Soil Moisture Active Passive satellite) [58] is a public dataset from NASA with the dimension of 25.Following the setting in previous studies,the dataset is split into consecutive non-overlapping segments in a pre-processing step.Abnormalities in a segment are considered detected if a single abnormal time point in that segment is identified.More details of the dataset can be found in Table 1.As a commonly adopted metric for unsupervised point-wise representation learning scenarios,the reconstruction error is considered a natural anomaly criterion in experiments.Additionally,various criteria,including Precision,Recall,and F1-score metrics,are adopted to evaluate performance comprehensively.
Table 1:Dataset details
To ensure a fair comparison,we adhere to the settings in previous studies [52].The nonoverlapping windows size is set to 100 for all datasets,and a time point is labeled as an anomaly if the anomaly score is larger than a threshold determined by the statistics of the training set.We empirically found the optimal architectural setting based on grid search and GPU memory constraints.The setting that achieved the best result on the validation set is selected for experimental comparison on the test set.Specifically,the feature blocks are stacked three times for all datasets(L=3).The sizes of hidden layers in series mixing and feature mixing are set to 32 and 64,respectively.Regarding the CAFFN model,the first FC layers in the attentive fusion module map the input to a feature with half of the input’s length.The second FC layer does not change the dimensionality,and the third FC layer maps the feature back to the space whose dimensionality is the same as the module input.Adam,with default settings,is used for parameter optimization with a batch size of 128,and the training process is stopped within 10 epochs.All experiments are implemented using Pytorch and run on a computer equipped with an NVIDIA RTX3090 GPU.
The performance of the proposed CAFFN model on time series anomaly detection datasets is shown in Table 2.Additionally,we made a comparison to highly related competitive methods.In particular,MLP-based [32,59],RNN-based[60],CNN-based[26,61],and many other Transformerbased time series anomaly detection are considered.As seen,the widely used F1-score metric of the proposedCAFFN model on SMD,MSL,and SMAP are 85.81%,85.48% and 71.52%,respectively.This indicates the proposed CAFFN can outperform RNN-based methods like LSTM by a large margin,primarily attributed to its capability to capture long-term dependencies.Moreover,the performance of the proposed CAFFN model is superior to many other Transformer based methods,showcasing its superiority in modeling the complex feature of time series.This aligns with previous studies that have surprisingly found that even simple linear models can outperform Transformer-based models[32].The slightly worse performance achieved by TimeNet[26]indicates that capturing features in 2D space can provide strong results.The proposed CAFFN model learns features in both 1D and 2D spaces,which could be the main reason for its superior performance.Compared to existing deep learning-based time series anomaly detection methods,the proposed CAFFN model employs a welldesigned feature extraction block,which can provide the stronger capability to capture the spatial and temporal features of time series data.
Table 2:Quantitative results for CAFFN (Proposed) in three real-world datasets.The P,R,and F1 represent the Precision,Recall and F1-score,respectively.For a fair comparison,reconstruction error is adopted as an anomaly criterion for all the compared methods
The reconstruction errors on the SMD training set and validation set during model training are recorded and presented in Fig.4.It can be observed that the error decreases on both the training and validation sets,validating the capability of CAFFN in modeling time series data.
Figure 4:Reconstruction errors on SMD dataset
The detection results for some test segments in MSL are illustrated in Figs.5 and 6.It is evident that the anomaly score significantly increases when the segments of the time series contain anomalous events,indicating that abnormal patterns are reconstructed with a substantial error.Although the time steps of annotated anomalies and predicted ones are not precisely aligned,common practice usually allows for the detection of anomalies in a reasonably wide window.Therefore,the detection results serve as an accurate indicator to localize the time points of anomalies,as shown in the figures for most of the time.
Figure 5:Anomaly score and ground-truth on MSL test set from time step 45985 to 46125(values of ground truth are adaptively scaled for better visualization)
Figure 6:Anomaly score and ground-truth on MSL test set from time step 55550 to 55830(values of ground-truth are adaptively scaled for better visualization)
In this subsection,we evaluate the effectiveness of each component in the proposed CAFFN model.Firstly,different settings of the mixing block are investigated by disabling either the series mixing block or feature mixing block.Specifically,we can remove the feature mixing part from the mixing 1D block without affecting the output shape.Additionally,by directly feeding the input to the feature mixing module and removing the series mixing module,we can also obtain a valid mixing 1D block,as shown in Fig.1.The results are shown in Table 3.
Table 3:Performance comparison using different series mixing and feature mixing settings
Additionally,the feature branch in the proposed CAFFN model is investigated similarly.We first disable the branch that uses a 2D block while keeping the branch that uses a 1D block valid.In this setting,the proposed model lacks the capability of capturing 2D features,and cross-dimension attentive feature fusion is not needed.The performance of this configuration is shown in Table 4,where a considerable performance degradation can be observed in the first line.Next,we enable the branch that uses 2D block but disable the branch that uses 1D block.In this case,the model is incapable of learning 1D features,and the results are shown in the second line in Table 4.It can be seen that the performance is slightly better than in the previous situation,verifying the advantage of employing FFT for discovering 2D structures in time series.The performance further improves when both branches are enabled,as indicated by the final line in Table 4,where features from 1D and 2D spaces are fused for anomaly detection.Therefore,the two-branch structure is necessary for obtaining promising performance.
Table 4:Performance comparison using different feature branch settings
To validate the effectiveness of the attentive feature fusion mechanism in the proposed method,we compare it to other alternative feature fusion schemes,including multiplication,concatenation,and addition.The results of different feature fusion strategies are shown in Table 5.It can be observed that all three settings provide a slightly worse performance than the proposed attentive mechanism,verifying the merits of the CAFFN model.
Table 5:Performance comparison using different feature fusion strategies(Mul,Cat and Add represent multiplication,concatenation and addition,respectively)
We further investigate the impact of different segment sizes on the performance,and the results are shown in Table 6.It indicates that the segment size has a slight influence on the performance,and setting the segment size to 100 can achieve promising results.
Table 6:Performance comparison using different segment sizes
We also investigated the parameter sensitivity of the proposed CAFFN model on the SMD dataset.The results are presented in Table 7.It is evident that the proposed model yields favorable outcomes when configured with a three-layer structure,along with series and feature mixing blocks set to dimensionalities of 32 and 64,respectively.
Table 7:Parameter sensitivity study on SMD dataset (#Layer means the number of layers,and S/F represents dimensionalities of series/feature mixing modules,respectively)
This study proposes a cross-dimension attentive feature fusion network for time series anomaly detection.As a reconstruction-based time series anomaly detection method,we introduced a series and feature mixing block to learn representation in 1D space.Additionally,we adopted a fast Fourier transform to convert the time series into 2D space for learning 2D representations.Furthermore,a cross-dimension attentive feature fusion mechanism was designed to effectively utilize the 1D and 2D features,adaptively integrating features across different dimensions for anomaly detection.Experiments on real-world time series datasets demonstrated that CAFFN outperforms other competing baselines.Moreover,the ablation study confirmed the effectiveness of the feature learning module and the feature fusion mechanism.Future investigation directions include exploring signal processing techniques and generative models for data and feature augmentation.
Acknowledgement:The authors wish to express their appreciation to the reviewers for their helpful suggestions which greatly improved the presentation of this paper.
Funding Statement:This work was supported in part by the National Natural Science Foundation of China (Grants 62376172,62006163,62376043),in part by the National Postdoctoral Program for Innovative Talents (Grant BX20200226) and in part by Sichuan Science and Technology Planning Project(Grants 2022YFSY0047,2022YFQ0014,2023ZYD0143,2022YFH0021,2023YFQ0020,24QYCX0354,24NSFTD0025).
Author Contributions:The authors confirm contribution to the paper as follows:study conception and design:Rui Wang,Yao Zhou,Dezhong Peng;data collection:Peng Chen;analysis and interpretation of results:Guangchun Luo;draft manuscript preparation:Rui Wang.All authors reviewed the results and approved the final version of the manuscript.
Availability of Data and Materials:The data used in this article are freely available in the mentioned references.
Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.
Computer Modeling In Engineering&Sciences2024年6期