Feilu Hang,Wei Guo,Hexiong Chen,Linjiang Xie,Chenghao Zhouand Yao Liu
1Information Center,Yunnan Power Grid Company Limited,Kunming,650034,China
2Network and Data Security Key Laboratory of Sichuan Province,University of Electronic Science and Technology of China,Chengdu,610054,China
ABSTRACT Modern large-scale enterprise systems produce large volumes of logs that record detailed system runtime status and key events at key points. These logs are valuable for analyzing performance issues and understanding the status of the system.Anomaly detection plays an important role in service management and system maintenance,and guarantees the reliability and security of online systems. Logs are universal semi-structured data, which causes difficulties for traditional manual detection and pattern-matching algorithms.While some deep learning algorithms utilize neural networks to detect anomalies, these approaches have an over-reliance on manually designed features,resulting in the effectiveness of anomaly detection depending on the quality of the features.At the same time,the aforementioned methods ignore the underlying contextual information present in adjacent log entries.We propose a novel model called Logformer with two cascaded transformer-based heads to capture latent contextual information from adjacent log entries,and leverage pre-trained embeddings based on logs to improve the representation of the embedding space.The proposed model achieves comparable results on HDFS and BGL datasets in terms of metric accuracy,recall and F1-score.Moreover,the consistent rise in F1-score proves that the representation of the embedding space with pre-trained embeddings is closer to the semantic information of the log.
KEYWORDS Anomaly detection;system logs;semi-structured data;pre-trained embedding;cascaded transformer
With the development of the Internet and the ever-increasing number of Internet users, online systems are evolving and growing in functionality and size.Anomalous events can occur at any time,and high demands have been made for quality of service and security guarantees[1-4].More services mean more sources of anomalous events,and even a small anomalous event can cause the entire system to respond to it.If abnormal events cannot be detected and removed in a timely manner,the stability and availability of the entire system will be affected.Failure to ensure the stability of the system leads to customer distrust and financial losses.Logs record software status and system information at critical points,providing a rich source of data for system monitoring and anomaly detection.In other words,anomaly detection helps to detect anomalous events and safeguard the stable operation of the system.
An online system typically produces large-scale semi-structured logs. It is impractical to detect and localize anomalies with traditional rule matching and manual screening methods based on logs. As an alternative, many machine learning algorithms have been proposed to extract features from semi-structured logs. PCA [5] is used to extract the principal components of logs to improve classification models.LR[6]and SVM[7]are proposed to classify logs based on the manual designed features. Nevertheless, the aforementioned approaches neglect the long-range dependence on the log. Later, deep learning algorithms were proposed to automatically learn representations for logs without manually designed features. However, most of them fail to extract the potential contextual information present in adjacent log entries, making them uncompetitive in terms of performance.RNN-based models have a natural advantage in capturing long-range dependencies,and LSTMs[8,9]have been proposed to learn extractive representations for logs.Given that randomly initialized LSTMs are difficult to optimize and LSTMs lack parallel computational capacity due to their recurrent structure,self-attention models have been proposed to replace LSTMs with efficient parallel structures.Transformer-based models[10]have been proposed to capture long-range dependencies in an efficient parallel fashion. However, these approaches neglect the potential contextual information present in the adjacent log entries.
To address the above issues, we propose a novel model called Logformer, which consists of two cascaded transformer architectures with an encoder head and a decoder head.The encoder head works well in encoding adjacent log entries into a vector representation,and the decoder head learns latent context information from adjacent log entries. In order to preserve the useful features of logs, we propose a log preprocessing method to replace the regular log parser.In the meantime,we adopt the Glove algorithm to train embeddings for logs,thus making the embedding space more closely related to the log semantics.Experimental results on the HDFS dataset and BGL dataset demonstrate that Logformer outperforms other approaches.
As an important part of a large-scale system,logs have been of great concern for many years.Many researchers have attempted to use log data to understand the runtime status of a system. However,system logs are usually semi-structured data,which is difficult to handle.In the beginning,keywords are used to find the location of systematic errors. This method can only detect an explicit single anomalous log entry and cannot detect an anomalous event based on the sequence of operations.In other words,an anomalous event in the system log cannot be detected by manually designed keywords.To address the above issues, matching methods [11,12] have been proposed for anomaly detection.These methods rely heavily on manually defined rules in advance and are unable to detect anomalous events from new sources.
With the development of deep learning,deep learning approaches are being applied in the field of anomaly detection.Most of them consist of three steps[13]:first,logs are converted into templates by a log parser;then,the templates are fed into a neural network to learn vector representation for logs;finally,traditional machine learning algorithms such as SVM are applied to classify a log entry into normal or abnormal categories with the learned vector representation.Due to the scarce fraction of anomalous log entries,it is often difficult to extract features from anomalous data.Many researchers use unsupervised approaches[14,15]for anomaly detection.
In terms of semi-structured log entry preprocessing, many methods use parsers. Common ones include Drain[16],AEL[17],IPLoM[18],and Spell[19].There are also approaches that attempt to obtain semantic vectors directly from embeddings without using a parser.Instead of using parsers for log entries,embeddings are an alternative way to obtain vector representations.LogNL[8]is proposed to utilize the TF-IDF algorithm to obtain template feature representations, and then construct parameter value vectors for logs of different templates.There are several literature proposals to train embeddings for logs by Word2Vec and SIF algorithms.
The attention mechanism can better capture the long-term dependencies in the data and improve the model’s ability to extract the most relevant input features. In recent years, attention has been successfully applied to image processing,natural language processing,recommendation systems,and other fields. Xu et al. [20] proposed a model based on attention and AutoEncoder. With attention,the decoder can adaptively select the desired features in the input sequence, which improves the performance of the Encoder-Decoder structure.
Recently,deep learning models[21-26]have achieved remarkable results in the field of anomaly detection. These approaches adopt RNN-based, LSTM-based, or transformer-based models as baselines to extract representations from log entries and predict normal or abnormal labels for log entries based on feature representations.Most of these approaches ignore the underlying contextual information present in the adjacent log entries. Different from regular deep learning models, Logformer effectively captures the long-range dependencies among adjacent log entries through a single decoder head and makes full use of the textual information of the log to train embeddings for the log.
In this section, we describe in detail the log preprocessing method, the pre-trained embedding algorithm,and the model architecture.
The model architecture of Logformer is a multi-layer of self-attention and feed-forward network,as shown in Fig.1.The Logformer consists of an encoder head,which encodes log entries into vectors,and a decoder head,which extracts latent contextual information present in adjacent log entries.Before feeding log entries into Logformer,the preprocessing method is applied to output structured log data.Given an adjacent log entry X =(x1,x2,...,xn), the embedding layer maps X to word embeddings E =(E1,E2,...,En), whereEidenotes the word embedding for a log entryxi. To exploit the textual information of the logs, pre-trained embeddings trained with the Glove algorithm are used on the logs to initialize the embedding layer.To maintain the order of log entries,position embeddings are added to both the encoder and decoder heads.The encoder head outputs representation is given as H=(H1,H2,...,Hn)for log entries,whereHiis the representation for log entryxi,and H is regarded as a new sequence.The decoder head takes the H as input and learns the interactive information among different log entries,then outputs a new representationH'for each log entry with abundant context information. Finally,H'is fed into a linear classifier layer to predict whether a log entry is normal or not.
Figure 1:Model architecture:Logformer for log anomaly detection
Original logs are semi-structured data, and log parsers are widely used in most deep learning methods[22,23].Lupton et al.[27]carried out a statistic of log parser methods,including publishing year,the number of citations and performance,as shown in Table 1,where AvgPA is an indicator of the average PA of all 16 datasets in Loghub which is detailed described in He et al.[28].It can be observed from Table 1 that the top three log parsers with the highest number of citations also fail to achieve an Avg PA of 0.9 in the large-scale open-source dataset,which means that these parsers still bring up an unacceptable error.At the same time,the log parsers discard log-level information,resulting in the final source feature log loss. As a result, Logformer takes a different approach to log preprocessing than log parsers.The log preprocessing of Logformer generally consists of two steps.First,each log entry is split by commas,spaces,and other common separators.The log is then converted to lowercase and some characters that do not contain valid information for anomaly detection are removed,such as numbers representing variables and characters not in the template such as‘for’,‘ID’and‘of’.Fig.2 shows the comparison between the original logs and the preprocessed logs.
Table 1: Effectiveness of log parser
Table 1 (continued)Parser Year Citations BGL PA Avg PA LTMatch 2021 18 0.933 0.889 Paddy 2020 1 0.963 0.895
Figure 2:Original logs and preprocessed logs
This section contains the WordPiece and Glove algorithms, where WordPiece is a subword segmentation algorithm for natural language processing and Glove is a word embedding algorithm for pre-trained embeddings of a corpus.
3.3.1 Tokenizer
Although logs are preprocessed properly in the log preprocessing phase, there are still existing artificially generated words in preprocessed logs like ‘DataNode$PacketResponder’. The directly extracted semantic vectors in the form of such words are not well understood, so we still need to perform further word segmentation on the extracted contents.We chose the WordPiece tokenizer to tokenize the preprocessed logs and output the tokenized corpus. WordPiece is trained from a small vocabulary and selects the words that are most likely to occur until all words have been learned.
3.3.2 Glove
To match the embedding spaces with logs,Glove[29]is adopted to train the pre-trained embedding,instead of randomly initializing the embedding layer.As Word2Vec[30]is trained from words in a local sliding window and Glove is trained based on the global statistic co-occurrence matrix,Glove is superior to Word2Vec with global information of the corpus.Let us denote the co-occurrence matrix as X ∈R|V|2, where Xijmeans the times of word j occurs in the context of the word i, and V means the vocabulary size. Denoting the embedding for word i and j as wiandrespectively, Xijcan be approximated by calculating as Eq.(1).
The training objective function can be formulated as Eq.(2).
where f(Xij)is a weighting function and can be parameterized as Eq.(3).
whereαand Xmaxare set to be 0.75 and 1.00, respectively.αworks well in improving the accuracy of low-frequency words by scaling up the weight of low-frequency words.Meanwhile,Xmaxlimits the maximal times of occurrence.
Most existing deep learning approaches take embeddings of preprocessed logs as input and predict a log entry independently.A single log term may appear to be normal,but when it occurs consecutively in adjacent log entries,it may indicate the occurrence of an anomaly.Therefore,exploiting the longrange dependency information among adjacent log entries can be beneficial for anomaly detection.In this work,Logformer is proposed to address the aforementioned issues.The Logformer consists of a cascaded transformer architecture,an encoder head and a decoder head,where both the encoder and decoder heads have the same architecture.The encoder head contains 6 stacked layers,each consisting of two sub-layers, a multi-head self-attention layer, and a feed-forward network. While the decoder head contains only 1 layer,it also has two sub-layers similar to the encoder head.In the following,we describe the self-attention layer and the feed-forward network in detail.
3.4.1 Multi-Head Self-Attention
Logformer uses the multi-head self-attention to capture long-range dependencies in the adjacent log entries.The heart of the encoder and decoder heads is self-attention,which maps a query and keyvalue pair to an output where query,key,value,and output are all vectors.Conventional self-attention only contains one head,and multi-head self-attention contains several heads as shown in Fig.3.
Figure 3:Attention mechanism
Denoting the query, key, and value as Q, K, and V ∈Rdmodel, separately, self-attention can be calculated as follows:
wheredmodelis the dimension of value.
Instead of performing a single attention function, Logformer captures richer contexts with multiple individual attention functions. Multi-head self-attention can be regarded as the repetition ofhtimes of self-attention,wherehis the number of attention heads.And multi-head self-attention can be calculated as Eq.(5).
whereWQ i,WK iandWV i∈Rdmodel×dh,dh=dmodel/h.
3.4.2 Feed-Forward Network
The feed-forward network is used to obtain the information in the channel dimension.It applies an expansion operation to x ∈Rdmodel×l, and then recovers the intermediate output to the original dimension,which is calculated as Eq.(6).
3.4.3 Positional Encoding
3.4.4 Batch Normalization
It is well-known that normalizing the feature maps makes training faster and more stable.Logformer incorporates Batch Normalization (BN) in attention in place of the original Layer Normalization (LN) in transformer. Layer Normalization is commonly used in RNN, where the sequence length is often not fixed. Since LN does not depend on batch size and sequence depth, it performs better in RNN. Compared to LN, BN shows better robustness and generalization ability.Also, BN has the advantage that it is generally faster in inference than other batch-independent normalizers such as LN. BN treats the batch data as a whole. The batch dimension is used in the calculations of both mean and variance [31]. An important feature of Logformer is to combine contextual information from multiple adjacent log entries.We want the normalization of Logformer to be sensitive to batch size,so we choose BN instead of LN.
In this paper, we select open-source datasets HDFS and BGL in LogHub [28] to validate the effectiveness of Logformer, as Table 2 shown. These two log datasets are widely used in the fields of anomaly detection. The HDFS dataset is generated by the benchmark workload in the private cloud environment,and labels are made by manually designed rules to identify the abnormal events.The BGL dataset contains logs collected from the BlueGene/L supercomputer system by Lawrence Livermore National Labs (LLNL). BGL is divided into altered and no-alter log entries. The first column in the BGL log contains‘-’or not,where‘-’means no-alter log entry,and the other is altered.
Table 2: Statistic of dataset
We select precision,recall,and F1-score as the evaluation metrics.These three metrics are based on the confusion matrix.The confusion matrix has four categories:True positives(TP)are examples correctly labeled as positives. False positives (FP) refer to negative examples incorrectly labeled as positive. True negatives (TN) correspond to negatives correctly labeled as negative. Finally, false negatives(FN)refer to positive examples incorrectly labeled as negative.
Precision is calculated as the percentage of correctly predicted positive samples accounting for all predicted positive samples. Recall is calculated as the percentage of correctly predicted positive samples accounting for all real positive samples.And F1-score is an indicator to compute the average of precision and recall.
where TP,FP,and FN mean the true positives,false positives,and false negatives,respectively.
We construct Logformer by two cascaded transformers. The number of transformer layers of encode head and decode head is 6 and 1,respectively.The hidden dimension is 128 and the number of attention heads in the logformer is 6.Layer normalization is replaced with batch normalization in the transformer layer.We use AdamW to optimize all parameters.The learning rate is set to 5e-4 and the batch size is set to 32.
Experiments on HDFS and BGL datasets are shown in Table 3. We compared Logformer with several existing methods in two public datasets,including the data-driven method PCA[32],traditional machine learning method LR[33]and SVM[33],and the deep learning method LogRobust[34].Due to its limitation,PCA achieves poor results in both HDFS and BGL datasets.It can be observed that both conventional machine learning and deep learning methods obtain consistent high performance in the HDFS dataset.The reason behind this is that log entries in HDFS tend to be more structured than BGL,and abnormal log entries are quite different from normal log entries.In the complicated BGL dataset,it can be seen in Table 3,that LogRobust and Logformer outperform LR and SVM by a large margin.Significantly,Logformer is superior to LogRobust with an increment of 8%in F1-score,which demonstrates that Logformer is more suitable for complicated semi-structured logs.
Table 3: Experiment results on HDFS and BGL datasets
To invalidate the hypothesis that pre-trained embedding in logs makes the embedding space match the textual information of logs better, we conduct comparative experiments by respectively using pre-trained embedding and randomly initialized embedding in HDFS and BGL datasets. In addition,we validate the importance of extracting context information existing in adjacent log entries by adding/removing the encoder and decoder head.Finally,we discuss the influence of the number of log entries on anomaly detection through experiments.
As shown in Table 4, the pre-trained embedding is superior in randomly initializing embedding in all metrics in both HDFS and BGL datasets. Pre-training can extract the prior knowledge of the task,improve the performance of Logformer by training the embeddings,which makes good use of the textual information of logs.
Table 4: Experiment result on the effectiveness of pretraining embedding
When validating the effectiveness of the encoder and decoder header, we set three comparative studies.w/o means predicting directly after an average sum of the pre-trained embedding;w encoder means predicting from the output of the dencoder head; w encoder-decoder means predicting from the output of the decoder head. It can be observed from Table 5 that the encoder head obtains a better result than pre-trained embedding in two datasets,which demonstrates that the self-attention layer in the encoder head can capture the log-range dependency information existing in a log entry.After adding both the encoder and decoder head,the results are better than only adding the encoder head,which means the decoder head improves the performance of Logformer by extracting potential information from adjacent log entries.
Table 5: Experiment result on the effectiveness of decoder head
As shown in Table 6,we try to use different batch sizes for experiments.In the Logformer,batch size represents the number of logs used for context combination.With the increase in batch size,the detection performance of the Logformer is also slightly improved,which is in line with the intuition.But a larger batch size means larger resource consumption and longer time,and the batch size can be adjusted as needed during actual use.
Table 6: Experiment result on the effectiveness of batch size
The experimental results show that the Logformer using the complete cascaded structure achieves the best results in all datasets, which proves that our proposed cascaded structure is efficient. The encoder first combines multiple semantic vectors to complete the encoding, and then the context information association between adjacent log entries is captured by the decoder.The semantic vector containing rich context information can make a significant contribution to system log anomaly detection.
This paper proposes an anomaly detection method with the cascaded structure that can make full use of potential context information among adjacent logs. We also propose a log preprocessing method to convert semi-structured logs into structured logs by removing common punctuation and other redundant characters in logs. To make the embedding space match with the textual semantic of logs,the WordPiece algorithm is adopted to tokenize the preprocessed logs into subwords,and the Glove algorithm is used to train embedding based on the log corpus. A cascaded structure model Logformer is finally designed to learn vector representation for each log entry and extract long-range dependency among adjacent log entries.
Logformer achieves superior performance compared to conventional machine learning algorithms and some deep learning models. From the perspective of two ablation studies, Logformer outperforms other approaches by efficiently extracting potential information existing in adjacent log entries. However, there are still more challenges in the real scenario. We do not provide the expert system to realize the feedback mechanism,and the feedback mechanism often plays a crucial role in an online system.The above problems are the direction of our future efforts.
Acknowledgement:The authors wish to express their appreciation to the reviewers for their helpful suggestions which greatly improved the presentation of this paper.
Funding Statement:This work was supported by the National Natural Science Foundation of China(Nos. 62072074, 62076054, 62027827, 61902054, 62002047), the Frontier Science and Technology Innovation Projects of National Key R&D Program (No. 2019QY1405), the Sichuan Science and Technology Innovation Platform and Talent Plan (No. 2020TDT00020), the Sichuan Science and Technology Support Plan(No.2020YFSY0010).
Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.
Computer Modeling In Engineering&Sciences2023年7期