Event Normalization Through Dynamic Log Format Detection

2014-07-19 12:24AmirAzodiDavidJaegerFengChengandChristophMeinel
ZTE Communications 2014年3期

Amir Azodi,David Jaeger,Feng Cheng, and Christoph Meinel

(Hasso Plattner Institute,University of Potsdam,14482 Potsdam,Germany)

Event Normalization Through Dynamic Log Format Detection

Amir Azodi,David Jaeger,Feng Cheng, and Christoph Meinel

(Hasso Plattner Institute,University of Potsdam,14482 Potsdam,Germany)

The analytical and monitoring capabilities of central event re⁃positories,such as log servers and intrusion detection sys⁃tems,are limited by the amount of structured information ex⁃tracted from the events they receive.Diverse networks and ap⁃plications log their events in many different formats,and this makes it difficult to identify the type of logs being

event normalization;intrusion detection;event stream process⁃ing;knowledge base;security information and event manage⁃ment

1 Introduction

1.1 Event Sources

E vents related to security and operation occur in a va⁃riety of places in a network.Generally,every com⁃puterized device can generate events that can be logged.Both security⁃information and event⁃manage⁃ ment(SIEM)systems and intrusion⁃detection(ID)systems are designed to detect unauthorized intrusions into a network.Both of these systems generally operate on the events created by net⁃work⁃enabled devices.Such devices can be classified as hosts or other low⁃level hardware.On its own,a host can have multi⁃ple sources of events at different levels of operation,i.e.,at the OS level and application level[1].

The OS is a fundamental source of events on a host.The core of the OS is the OS kernel,which has the highest access privileges and can therefore provide valuable information about the system state and operations performed on the system. This information includes the hardware configuration and state;allocated system resources;security⁃related operations,such as authentication;and authorization for network and phys⁃ical access to the host.

Applications running on top of the OS can almost generate endless events relating to many different conditions and are therefore another rich source of event information.

Hardware includes devices that are not host computers in the usual sense,i.e.,they run with a static/semi⁃static software configuration delivered via pre⁃installed firmware.Devices that fall into this category are network infrastructure devices,such as routers;switches;and peripherals,such as printers or VOIP telephones.Event information from infrastructure devic⁃es is especially valuable for SIEM and IDS systems because such devices,especially routers and switches,can intercept connections and control access between nodes.

1.2 Challenges in Event Normalization

When configuring hosts to forward their logs to a log server,an administrator ideally specifies the server’s connection de⁃tails,e.g.,the hostname,port,and communication protocol,without configuring the server for individual connections.This means that the client can forward a single event stream con⁃taining many diverse events in the system at different levels of the software stack.On the receiving end,the server must sepa⁃rate the different event types and handle them according to their type.The main challenge in handling diverse events lies in identifying the types of events submitted for normalization. Events created by different systems are likely logged using dif⁃ferent formats;therefore,there are differences between events that can make normalization difficult.

Each log format has its own limitations.For instance,win⁃dows event log favors software debugging and common event expression(CEE)standard allows for limitless arbitrary fields to be added to it.However,CEE does not provide a standard⁃ized method of logging even the most basic events related to de⁃bugging.

Some log formats,such as Syslog or Snort,are partly or com⁃pletely unstructured.Unstructured information is a challenge for event normalization because this information cannot be di⁃rectly mapped to another format if the ontology of the two does not match or overlap for these fields.An example is the mes⁃sage(MSG)field in Syslog.

Even when information in a log format is completely struc⁃tured,an isolated event often does not provide enough informa⁃tion to be interpreted correctly.Missing information may be de⁃tails about the log producer,source and target of an activity,or exact time when an event was observed.

1.3 Motivation

When we look at the variety of log formats used by event sources,there is no standard for logging and no agreement on what information is needed to make a log useful for security analysis.SIEM systems,such as Splunk[2],ArcSight[3],Pre⁃lude[4]and RSA Envision[5],do not widely use standardized formats or standards focused on alert representation,i.e.,events solely focused on system and network security.Consid⁃erable effort has been made to address some of these issues [6]-[8];for example,ArcSight has created its own format,but it is not widely used.We studied SIEM systems and came up with a proposal based on the now⁃discontinued CEE standard⁃ization efforts of Mitre and its partners[9].We aim to better de⁃fine the needs of certain sectors of the IT industry and better define the logs they produce so that these logs can be written in a single,all⁃encompassing format.This format must be high⁃ly regulated and meet the needs of network security systems and operational monitoring systems.

1.4 Contribution

The main goal of our research is to provide a solution for an SIEM/IDS system being developed at the Hasso Plattner Insti⁃tute in Germany.This SIEM/IDS system is actually a real⁃time event⁃analysis and monitoring(REAM)system[10]that pro⁃vides deeper monitoring and analytics for systems across the network.In order to function,the REAM system needs to read arbitrary log files and generate unified events that incorporate all the information gathered from these logs.At the same time,it adds more information where relevant and possible.In this paper,we describe a workable solution for unifying event for⁃mats.This solution draws on human knowledge and informa⁃tion embedded within the logs to produce a normalized event⁃representation model.The solution can be used to correlate events produced by different sources within a network.We de⁃scribe how named⁃group regular expressions(NGREs)and a knowledge base can be used to create and populate events with accurate information.

2 Event Normalization

In this paper,the format,syntax,and method of persistence of an event are encompassed within a unified event representa⁃tion model(UERM).Often,the mapping of elements from one UERM to another is imperfect because there is no logical con⁃nection between some elements within the UERMs or an ele⁃ment in one UERM does not exist and is not represented in the other UERM.Therefore,any modern standardized log format has to be highly flexible and have a unified structure.The CEE format provides this flexibility by allowing templates(profiles) to be laid on top of a core profile[11]that comprises elements deemed absolutely necessary for any log entry.These elements include information about the event producer,time the event occurred,and other information.With this layout,software ven⁃dors can design their own profiles and lay them on top of the core CEE profile.UERMs such as CEE,CEF,IODEF,IDMEF and Cybox have problems that prevent their use in a REAM system.These problems include lack of openness in the UERM(i.e.,new fields can only be added by the vendor),and missing essential features/elements(e.g.,lack of support for common security fields,such as CVE)[12].Such fields are needed to encapsulate other UERMs,such as Syslog,Apache log,and Snort messages.Because CEE is no longer being de⁃veloped,another UERM,called object log format[13],is being used as the UERM in REAM systems.

In this paper,a security event is the encapsulation of any event related to the confidentiality,integrity,and availability of the system it represents.

2.1 Extracting Event Information from Logs

The steps involved in automatically extracting relevant infor⁃mation in a log line written by people are complex.Issues arise from the unstructured nature of content written and used by people.Some parts of a log entry are loosely structured.Parts such as“client sent HTTP/1.1 request without hostname(see RFC2616 section 14.23):/”can provide important information that,if understood by a machine,could lead to automatic dis⁃covery of patterns that are useful in security analysis.One solu⁃tion implemented in some SIEM systems is collective analysis of all logs of the same type[3],[5].However,this solution has deficiencies,including omission of important cross⁃reference information from the analyzed dataset.If a log produced by one app is connected to another log produced by another app,then this connection will be missed.Another issue with creating streams of logs from the same application is that the server has to handle many different connections from a single host and has to be given specific information.

The ultimate goal is to analyze a single,unified,structured set of event information as opposed to multiple sets of informa⁃tion,each of which has its own format.To read logs from the source,understand these logs,and convert them into another format,regular expressions with named capturing groups are necessary.NGREs were first introduced in the Perl program⁃ming language,and with Java 1.7,have recently been added to the standard Java API.By using NGREs,it is possible to ex⁃tract every bit of information from a log line while intelligently assigning them to their mapped attribute within the unified log format.Fig.1 shows a complex log line produced by Apache.

▲Figure 1.Apache access log.

Apache access logs is a good example of the easy extraction of information from logs because Apache presents information in a structured manner.With Apache’s Combined Log Format specification,it is possible to write a single NGRE that can match all possible outputs of one of the most widely used web servers in the world.Fig.2 shows an NGRE that can handle al⁃most all access logs generated by the Apache web server.Al⁃though it may look complicated,this line only needs to be writ⁃ten once and thereafter can be used by any number of users,none of whom need to know what an NGRE is or how it oper⁃ates.

After applying the named regular expression from Fig.2 to the log in Fig.1,we end up with a list of key⁃value pairs(Ta⁃ble 1).

In most modern programming languages,it is easy to convert such a list into a structured object.

2.2 Creating New Events from a Normalized Event

▲Figure 2.NGRE for Apache access logs.

▼Table 1.Extracted information(key/value)

The results of NGRE matching depend on the event instance being matched.The more information inside the event,the more details can be extracted and mapped to their representa⁃tive values in the new unified log format.In cases where a con⁃siderable amount of information is missing from the event in⁃stance,human knowledge can greatly help to add the missing information.Therefore,a subsystem was developed to add hu⁃man knowledge to a unified event.This system(knowledge base)uses a table to store this information alongside the NGREs used to match the event being processed.When a match is made,the information in the knowledge base is used to help build the instance of a unified event.Information ex⁃tracted from an event using NGREs may be different for two events of the same type.These fields are referred to as dynam⁃ic fields.In addition to these fields,there are fields produced by a human and stored in the knowledge base alongside the NGRE.These fields are constant for two or more instances of the same event and are referred to as static fields.To create these fields,system developers have to analyze a representa⁃tive instance of a given event and include as much information about the instance as possible.This information should be ap⁃plicable to all instances of the event.Fields such as Time and Producer are not static because they differ from one instance to another.

Different log lines may have the same fields but different val⁃ues and ranges.To map their value,an interpretation step is necessary.Consider the Priority field.Many log formats speci⁃fy a field for the event’s priority.The more important the pro⁃ducer deems an event,the higher the priority assigned to it. Problems arise when different producers use different formats with different scales or methods for calculating the priority. Some event formats might specify a scale of 1 to 10 whereas others specify a scale of 0 to 255.To map these values directly to the scale used by the UERM representing priority,a script is run on the log line.We implement this script with JavaS⁃cript.Short scripts are stored in the database along with the NGRE strings,and the correct script is run when the log for⁃mat is recognized.The scripts are compiled upon first use,translated into byte code,and run on top of the Java virtual ma⁃chine(version 8).This provides flexibility because the scripts can be updated or replaced without interrupting the core of the REAMS system.

2.2.1 Defining Event Tags

To help with broader analytical tasks performed on logs,CEE includes a number of tags that can be used by developers to tag an event with a broad,generic label.These tags operate in much the same way as browser bookmark tags or tags used for organizing emails.Logs can be tagged with fields such as Domain,for which one possible value would be Web.This tag can then be used to measure the amount of web⁃related traffic as a percentage of total network traffic.One simple way to as⁃sign these tags to arbitrary logs is to use the knowledge base. Because events are matched,preset tags can be applied to new⁃ly created OLF events.

2.2.2 Creating Common Log Entries

Creating a common log format requires a deep understand⁃ing and precise mappings of the supported log formats,which are normalized into a single format.One of the more challeng⁃ing conversions is Syslog to OLF because Syslog has an inher⁃ently simple structure but its message(MSG element)has al⁃most limitless complexity.

2.2.3 Event Format Recognition

The proposed OLF model comprises hundreds of fields,each mapping to a different aspect of some event.Where possi⁃ble,duplicates are eliminated and attributes from different logs are mapped to a single field in the OLF.A REAM system uses the approach in[14]to efficiently find the type and variation of the log format currently being processed.Textual metrics and profiling techniques are used to mutate incoming events into an index key,which uses proximity searching to find a known relative of the event instance.When a match is found,an exact NGRE can be used to process the event.

2.2.4 Architecture and Design

The application uses a multithreaded approach:one thread reads the contents of a log file and pushes the logs to a queue,and then multiple worker threads access the queue.Each work⁃er thread attempts to match a regex against the log it has re⁃trieved from the queue using the approach in[14].When a match is found,information within the log is extracted and in⁃serted into an OLF,which is subsequently persisted to a data⁃base.Fig.3 shows the steps taken to normalize the events.

3 Related Works

▲Figure 3.Event detection.

Event normalization,especially incident management,is an ongoing area of research.Much effort has been made in stan⁃dardizing formats for making event data more persistent,espe⁃cially for vendors in the IDS domain.The incident object de⁃scription and exchange format working group(IODEF WG),ex⁃tended incident handling working group(INCH WG),and MI⁃TRE Corporation have provided standards for incident serial⁃ ization.These standards include the incident object descrip⁃tion and exchange format(IODEF)[15]and the intrusion detec⁃tion message exchange format[8](IDMEF).The problem with these incident formats is that they are limited to covering alerts only.However,if security breaches have to be analyzed further by inspecting postmortem activities,normally occurring,non⁃critical activities,such as file access,become more interesting. Looking at log management in general can help overcome the limitations of incident formats.Some approaches to log normal⁃ization originated from software products and others originated from standardization institutions.An example of the former is ArcSight’s common event format(CEF),proposed in a white paper[6]by Hewlett⁃Packard.The ArcSight format introduces a flat hierarchy of properties and comprises a set of typical event properties.Two examples limited to web servers are the Common Log Format and Combined Log Format[6].Both of these were introduced with the Apache web server.Two more generic approaches for event formats are given by MITRE Cor⁃poration,which proposed CEE[9]and Cyber Observable eX⁃pression[16](CybOX).CEE was a promising format because it provided a basic set of common event properties that can fur⁃ther be extended with more event properties as needed.Cy⁃bOX,on the other hand,is a very complex format that covers most activities in one big format.Nevertheless,CybOX is too bloated to be efficient in a production environment.In addition to research done by standardization bodies,other researchers have investigated ways of efficiently normalizing events.Avour⁃diadis and Blith[17],[18]propose integrating existing XML⁃based formats,such as IODEF,IDMEF and Format for INci⁃dent information exchange(FINE)[19]into one database.In this database,a core section holds data common in incident messages,and an extensible section can hold additional(un⁃common)data in various formats.The mapping of XML ele⁃ments to database fields is described by an XML document. The limitation of incident messages in XML formats is a big limitation because formats such as Syslog are not fully struc⁃tured.This makes it hard to map all available information.The authors do not describe how to map such unstructured data.

4 Conclusion

In this paper,we have discussed possible im⁃provements to IDS and SIEM systems and the need for better event representation.Two main avenues for regenerating more complete,uni⁃fied events from their initial representations are:1)extracting as much information as possi⁃ble from the event representation being inter⁃preted,and 2)adding human knowledge about the particular event to the unified representa⁃tion.Most systems attempt to tokenize existing representations of events in order to reproduce them in analytically friendlier formats.Howev⁃er,this is excessively complicated and resource⁃intensive. Adding support for new log formats often requires considerable effort in terms of development and deployment into existing systems(often via heavy version upgrades).Named⁃group regu⁃lar expressions can make the normalization of logs consider⁃ably easier.Such expressions can be written to include some logic,e.g.,a field may have been omitted from a given instance of a log,and they also allow the application to extract key⁃val⁃ue pairs from the log line.The keys can be the fields(or mapped to them)from the UERM used by the system.For the REAM system knowledge base,a UERM was built on the now defunct CEE standard.It is not easy to unify logs,and the task has been attempted by many SIEM system vendors in the past with varying degrees of success.The proposed approach for unifying events decreases the time and effort needed to expand event normalization in an SEIM system and create more com⁃plete,intelligent events for analytical purposes.

[1]Guide to computer security log management,Technology Administration U.S.De⁃partment of Commerce,Sept.2006.

[2]Splunk Inc..Splunk enterprise[Online].Avaiable:http://www.splunk.com/

[3]Hewlett⁃Packard.ArcSight logger[Online].http://www8.hp.com/us/en/software⁃solutions/arcsight⁃logger⁃log⁃management/

[4]CS.Prelude IDS[Online].http://www.prelude⁃ids.com/en/

[5]EMC2.RSA envision[Online].http://emc.com/security/rsa⁃envision.htm

[6]H.-P.ArcSight,“Common Event Format,”tech.rep.,July 2009.Rev.15.

[7]The Incident Object Description Eχchange Format,RFC 5070,Dec.2007.

[8]The Intrusion Detection,Message Eχchange Format(IDMEF),RFC 4765,Mar. 2007.

[9]A.Chuvakin,R.Marty,W.Heinbockel,J.Judge,and R.McQuaid,“Common event expression,”white paper,CEE Board,June 2008.

[10]Formerly(SAL)[Online].https://hpi.de/meinel/security⁃tech/network⁃security/ security⁃analytics.html

[11]T.M.Corporation.CEE core profile[Online].http://cee.mitre.org/language/1.0⁃beta1/core⁃profile.html

[12]The MITRE Corporation.Common vulnerabilities and eχposures(CVE)[Online]. https://cve.mitre.org/

[13]S.Andrey,J.David,A.Amir,G.Marian,C.Feng,and M.Christoph,“Hierar⁃chical object log format for normalisation of security events,”in Proceedings of the 9th International Conference on Information Assurance and Security (IAS2013),Kuala Lumpur,Malaysia,2013,pp.25-30.

[14]Azodi,“A new approach to building a multi⁃tier direct access knowledge base for IDS/SIEM systems,”In Proceedings of the 11th IEEE International Confer⁃ence on Dependable,Autonomic and Secure Computing(DASC2013),East Syra⁃cuse,NY,USA,2013.

[15]The Incident Object Description Eχchange Format,RFC 5070(Proposed Stan⁃dard),Dec.2007.

[16]The CybOX Language Specification,the MITRE Corp.,Apr.2012.

[17]N.Avourdiadis and A.Blyth,“Data unification and data fusion of intrusion de⁃tection logs in a network centric environment,”in Proceedings of the 4th Euro⁃pean Conference on Information Warfare and Security,Glamorgan,United King⁃ dom,2005,pp.9-20.

[18]N.Avourdiadis and A.Blyth,“Normalising events into incidents using unified intrusion detection⁃related data,”in Proceedings of the First European Confer⁃ence on Computer Network Defence School of Computing,University of Glamor⁃gan,Wales,UK,2006,pp.283-296.doi:10.1007/1⁃84628⁃352⁃3_28.

[19]Y.Demchenko,H.Ohno,R.Danyliw,and G.M.Keeni,“Requirements for for⁃mat for INcident information exchange(FINE),”Network Working Group,Dec. 2005.

Biographiesphies

Amir Azodi(amir.azodi@hpi.uni⁃potsdam.de)received his BSc degree in communi⁃cation networks from Oxford Brookes University.He received his MSc degree in in⁃formation security from University College London.He is currently a PhD student at the Department of Internet Technologies,Hasso Plattner Institute,Germany.His re⁃search interests include event normalization,intrusion detection,attack path detec⁃tion,and visualization.

David Jaeger(david.jaeger@hpi.uni⁃potsdam.de)is a PhD student in the IT Securi⁃ty Engineering Team,Hasso Plattner Institute,Germany.From 2006 to 2009,he studied IT systems engineering at the Hasso⁃Plattner⁃Institute.He received his BSc degree in 2009 and his MSc degree in 2012.His research interests include intru⁃sion detection,especially attack monitoring and analytics,as well as normalization of security⁃related information.

Feng Cheng(feng.cheng@hpi.uni⁃potsdam.de)is a senior researcher heading the IT Security Engineering Team at Hasso Plattner Institute in Germany.His research in⁃terests include network security,firewalls,IDS/IPS,security analytics,attack model⁃ing and penetration testing,SOA and Cloud Security.At the Hasso Plattner Insti⁃tute,he is involved in R&D and teaching activities revolving around new IT security technologies.He has been the principal investigator and project manager for many research projects on IT security,including the project“Physical Separation and its Lock⁃Keeper Implementation,”which was commercialized by Siemens Switzerland (now with Atos Origin)in 2005.He has published more than 30 papers in interna⁃tional conference proceedings and journals.He has been chair,co⁃chair,coordina⁃tor,program committee member,and reviewer for many international workshops and conferences.He received his BEng degree from Beijing University of Aeronautics and Astronautics;he received his MEng degree from Beijing University of Technolo⁃gy;and he received his PhD degree from the University of Potsdam,Germany.

Christoph Meinel(christoph.meinel@hpi.uni⁃potsdam.de)is scientific director and CEO of the Hasso Plattner Institute,Germany.In 2006,Professor Meinel and Hasso Plattner hosted the 1st National IT Summit of German Chancellor Dr.Angela Merkel at HPI in Potsdam.Dr.Meinel is a member of Acatech(the German Nation⁃al Academy of Science and Engineering)and numerous scientific committees and supervisory boards.Dr.Meinel is a full professor(C4)of computer science and is de⁃partment chair of internet technologies and systems at the Hasso Plattner Institute. He teaches courses in the Bachelor’s degree and Master’s degree programs in IT systems engineering and at the HPI School of Design Thinking.He has authored or coauthored nine books and four anthologies and has edited various conference pro⁃ceedings.He studied mathematics and computer science at Humboldt University of Berlin from 1974 to 1979.

t received:2014⁃04⁃14

10.3939/j.issn.1673-5188.2014.03.008

http://www.cnki.net/kcms/detail/34.1294.TN.20140819.0832.001.html,published online 19 August,2014

by the central repository.The way events are logged by IT systems is problematic for developers of host⁃based intrusion⁃detection systems(specifically,host⁃based systems),develop⁃ers of security⁃information systems,and developers of event⁃management systems.These problems preclude the develop⁃ment of more accurate,intrusive security solutions that obtain results from data included in the logs being processed.We propose a new method for dynamically normalizing events into a unified super⁃event that is loosely based on the Common Event Expression standard developed by Mitre Corporation. We explain how our solution can normalize seemingly unrelat⁃ed events into a single,unified format.