Research on Traffic Identification Technologies for Peer-to-Peer Networks

2007-06-19 13:56ZhouShijieQiuZhiguangWuChunjiang
ZTE Communications 2007年4期

Zhou Shijie Qiu Zhiguang Wu Chunjiang

(School of Computer Science and Engineering,University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China)

Abstrac t:The Peer-to-Peer(P2P)network traffic identification technology includes Transport Layer Identification(TLI)and Deep Packet Inspection(DPI)methods.By analyzing packets of the transport layer and the traffic characteristic in the P2P system,TLI can identify whether or not the network data flow belongs to the P2P system.The DPI method adopts protocol analysis technology and reverting technology.It picks up data from the P2P application layer and analyzes the characteristics of the payload to judge if the network traffic belongs to P2P applications.Due to its accuracy,robustness and classifying ability,DPI is the main method used to identify P2P traffic.Adopting the advantages of TLI and DPI,a precise and efficient technology for P2P network traffic identification can be designed.

I n recent years,Peer-to-Peer(P2P)networks have grown dramatically in scale,application and traffic.According to an analysis,Skype,a P2P-based voice communication software,has up to 9 million concurrent online users in China;the registered users of the P2Pstreaming video networks like PPLive and PPStream,have exceeded 100 million,of which one to five million are simultaneously online.P2P applications have extended from file sharing,in the past,to voice and video communications,nowadays.As for the traffic,the analysis report on actual Internet traffic in China shows that P2P network traffic accounts for 60%of the total Internet traffic.

Consequently,the international network equipment manufacturers and Internet service providers have introduced P2Ptraffic identification and monitoring products successively.P2P traffic identification devices include web cache equipment,application-layer traffic management equipment,flow statistics router and intelligent firewall.Cisco NetFlow technology[1],Allot's fault recovery traffic management solution[2],Cachelogic's P2Pmanagement solution[3],and NetSpective series of Verso Technologies[4]are examples of such devices.In these products,all companies use Deep Packet Inspection(DPI)technologies of their own,which are basically similar but different in performance and identification precision.

In China,few researches have been conducted on P2Pnetwork traffic identification.There are not only few quality academic papers,but also few efficient P2Pmultimedia content identification and filtering products.Although some Chinese network equipment manufacturers have launched P2Ptraffic monitoring products,for example,the network management software CAPof CAPTECH Company[5],they have problems in performance and overhead because all of them adopt DPI technology.

The research on effective,accurate and real-time identification and filtering of P2Ptraffic(especially multimedia contents)enables a better use of current Internet infrastructure and P2P technologies,and a reasonable deployment of P2Papplications as well.Besides,the effective,accurate and real-time identification and filtering of P2Ptraffic is much helpfulin preventing the propagation of illegal materials in the P2Pnetworks,thus ensuring a healthy environment for the Internet,and constructing a harmonious network society.

1 Difficulties in P2P Network Traffic Identification

The P2Pnetwork is a kind of distributed network where participants share some of their hardware resources(including processing and storage capabilities).These shared resources,with the services and contents provided by the network,can be directly accessed by other peer nodes without going through an intermediary entity.The participants in a P2Pnetwork are both resource providers(i.e.,the role of a server)and resource users(i.e.,the role of a client).

In addition to the typical application,which is file sharing(e.g.,Napster),the P2Pnetwork is applied in P2P-based communication network setup,P2P computing and other kinds of resource sharing.The basic idea of the P2P network,which is also its distinctive quality from the Client/Server(C/S)architecture,is the dual role of the node in the network:being a server to provide resources,and a client to access resources.In general,the rights and obligations of a node in a P2Pnetwork match each other with respect to communications,service and resource consumption.

P2Pnetworks can be divided into two models:pure P2Pand hybrid P2P.In the pure P2Pmodel,there are no centralized entities or servers in the network,and removalof an entity does not have much impact on the services of the network.However,the hybrid P2Pmodel requires a centralized entity to offer some specific network services such as meta information storage,indexing or routing,and security inspection.

The rapid development of P2P applications enriches the contents of the Internet,but the dramatic increase in traffic and unlimited occupation of bandwidth not only challenge the Internet infrastructure,but also bring many service deployment problems for the Internet Service Provider(ISP)and the Application Service Provider(ASP).In addition,P2Pnetworks have quickly become hotbeds for propagating malicious codes,pornographic materials or other unhealthy information,and pirated resources.

The quick identification and classification of P2Pnetwork traffics can provide technical support to the ISPs or ASPs to improve their Quality of Service(QoS),and it can also ensure effective monitoring of the network contents(e.g.,malicious code identification and virus defense).However,the intrinsic features of P2Pnetwork listed below make the accurate,efficient and real-time identification of P2Ptraffics more difficult.

(1)Uncertain

As the applications of a P2Pnetwork diversify,both its traffic characteristics and behavior become difficult to determine.Besides,the dynamic features of P2Pnetwork nodes make the traffic in a P2Pnetwork more uncertain;therefore,identifying the traffic becomes more difficult.

(2)Massive

In addition to diversified applications,the size of a P2Pnetwork is very large(for example,the concurrent online nodes of BitTorrent,a file sharing P2P system,may reach a maximum of 1 million).As a result,its traffic is often massive.Massive traffic in P2Pnetworks pose an obstacle to accurately and timely identify the traffic.

(3)Encrypted

Being in the application layer,the P2P network often tries to evade the content monitoring by way of encrypting its payload.This makes the common identification algorithms inapplicable to the P2Pnetwork.Therefore,new traffic identification methods or technologies have to be developed to ensure

accuracy and reliability of the identified P2Ptraffic.

In terms of technologies,the current P2Pnetwork traffic identification methods fallinto two categories:Transport Layer Identification(TLI)and DPI.

2 Transport Layer Identification

In a P2Psystem,each node functions both as a server and a client.This dual role of the node brings the P2P applications different traffic characteristics from other network applications(e.g.,Hypertext Transfer Protocol(HTTP),File Transfer Protocol(FTP),Domain Name Server(DNS),and e-mail)in the transport layer.

The basic idea of TLIis to identify whether or not a network stream is a P2P stream by analyzing the packets(including Transmission Control Protocol(TCP)and User Datagram Protocol(UDP)packets)in the transport layer and comparing them with the traffic characteristics of the P2Psystem.The methods in this category include TCP/UDPport number identification,network diameter analysis,node role analysis,protocolpair analysis and IP-port pair analysis.

The TCP/UDPport number identification method was developed based on the fixed service port characteristic of the first-generation P2P system.Sen and Wang[5]first addressed the P2Ptraffic identification problem and used the port number identification method to analyze the traffic characteristics of three typical P2P systems:Fast-Track,Gnutella and Direct-Connect.The service ports commonly used by current P2Psystems are listed in Table 1.However,nowadays,many P2Psystems use arbitrary ports to evade traffic auditing and filtering;therefore,lots of traffic volumes may be missed with this method.

The theoretical basis for the network diameter analysis method is that the diameter of a P2Plogical network is often quite large.In the P2Psystem,the connections between nodes are logical rather than physical;so,the P2Pnetwork is a logical network.Constantinou and Mavrommatis[6]presents a logical connection topology of the P2Psystem,obtained by recording the connections of each node with other nodes,and computes the network diameter.Their research shows the logical network of a P2Psystem has a larger diameter thanlogical networks of other applications.Therefore,if the diameter of a network exceeds a certain threshold,the nodes in this network should be regarded as P2P nodes,and the traffic of the network should be counted as P2Ptraffic.But with this method,the connections of the entire network have to be recorded in order to compute the network diameter;therefore,considerable storage and computation overheads are involved.Besides,the method does not support real-time identification and filtering of P2Ptraffic.

▼Table 1. Service ports commonly used by current P2P systems

The idea of node role analysis method comes from the unique dual role characteristic of the node in the P2P system.If some nodes in a logical network are found to play such dual roles,the network is then a P2Pnetwork.According to Constantinou and Mavrommatis[6],the number of nodes which act as both servers and clients is also recorded and computed.Once this number exceeds a certain threshold,the network these nodes belong to can then be determined as a P2Pnetwork,and its traffic should be computed as P2Ptraffic.Similar to network diameter analysis,the node role analysis method requires the connections of the entire network to be recorded.Thus,it is challenged with the same problems:large storage and computation overheads,and incapability of real-time identification and filtering of P2Ptraffic.

Both TCPand UDPprotocols are possibly used simultaneously in the P2P system,and the protocol pair analysis method exactly makes use of this case.Related analyses show that the P2P system often uses UDPto send control information such as commands,and uses TCPto transmit data.However,in common P2Papplications,the cases that both UDPand TCPare used simultaneously are rare.Therefore,the P2Ptraffic can be identified by analyzing the protocols used.In the protocol pair analysis method discussed at the 4th ACM SIG COMM Conference on Internet Measurement[7],the traffic between a pair of IPaddresses(source and destination)will be regarded as P2Ptraffic if both TCPand UDPhave been used between the address pair in a specific time t .Otherwise,the traffic is not P2Ptraffic.

However,other applications such as DNSmay use TCPand UDPat the same time,too;therefore,inaccurate traffic statistics often occurs with this method.

The IP-port pair analysis method[7]also takes advantages of the dual role of the node in the P2P system.In a P2Psystem,in order to accept the connection requests from other nodes,each node has to advertise its own IPaddress and a service port number(denoted as{destination IP,destination port},called destination IP-port pair).On the other hand,to set up connections with other nodes,each node uses a random port number and its own IP address(called source IP-port pair,denoted as{source IP,source port}).Therefore,when a node sets up connection with another,random source ports are used at both source node and destination node.But for any a destination node which has advertised its IP-port pair,the number of source IP addresses and the number of source ports involved should be the same.Meanwhile,in other applications(e.g.,HTTP),several connections may be required to transmit data so that the node from the same source IPmay use different source ports to set up several connections with the Web server.As a result,the number of its source IP addresses is often different from the number of its source ports.In conclusion,if the number of the source IPaddresses is the same as that of the source ports in a unit time t,the traffic can be regarded as P2Ptraffic.The IP-port pair analysis method performs excellently in identifying the P2Ptraffic,but it still cannot identify and filter the traffic in a real time mode.

In addition to the above-mentioned methods,there are other TLImethods.Horng Mongfong,et al[8]proposes a BitTorrent traffic identification method based on the following two aspects:

(1)Many nodes send a lot of data to a destination node and handshake packets occur in the destination node.

(2)Immediately after a node broadcasts lots of UDPpackets,it sends lots of handshake packets.

Reference[9]introduces a P2Ptraffic identification approach based on such TCPstream characteristics as the connection error rate of the P2Psystem.Seeing the“relay”feature of Skype,Suh,et al[10]present the method of identifying P2Pnetwork stream with the following parameters:start time difference,end time difference,bit rate,and cross correlation between two bursts.According to the experimental result in the reference,the metrics used for characterizing and detecting Skype-relayed traffic are as follows:the start/end time difference is less than 5 seconds,the input bit rate and the output bit rate of a stream are almost equal,and cross correlation between two bursts of streams is no less than 0.37 seconds.

3 Deep Packet Inspection

The DPItechnology,a traffic identification technology based on application layer data,adopts protocol analysis technology and reverting technology.It picks up data from the P2Papplication layer and analyzes the characteristics of the payload to judge if the netwosk traffic belongs to P2Papplications.This technology first gathers the characteristics of a specific P2Pprotocol and its system payload to form a characteristic library.Then it uses a pattern matching algorithm to detect,in a real time mode,the network stream that is going through the inspection.If the stream includes any characteristic string in the library,that is to say,the characteristics of the stream match those in the library,the stream is then regarded as P2Pdata.

▼Table 2. Comparison of TLI and DPI methods

As for applications of DPI technology,Sen et al[11]analyzed the protocol characteristics of Gnutella,Edonkey,DirebtConnect,BitTorrent and Kazaa,and compared the application layer data against these characteristics to determine if the data belongs to P2P traffic.Besides,Wang Ruiet al[12]used the application layer data analysis method to identify the multimedia traffic.

With regard to the combination of TLI and DPI,Madhukar et al[13]compared three P2Ptraffic classification methods:port-based classification,application-layer signatures and transport-layer analysis.Moreover,Ohzahata et al[14]introduced the decoy node and analyzed the traffic of Winny,the most popular P2Papplication in Japan,with application layer signature matching analysis.

4 Analysis on the Advantages and Disadvantages of TLI and DPI

The advantages of TLIinclude scalability,good performance and capability of identifying the encrypted data stream.

(1)Scalability:As the technology uses the common traffic characteristics of all P2Papplications,it can identify not only the traffic of existing P2P applications,but also the traffic of any new P2Papplication which has the common characteristics.

(2)Good performance:In this technology,it is not required to analyze and revert the protocol,or to analyze the payload of a specific P2Papplication.As a result,the computation and storage overheads are smalland the identification algorithm can perform quite well.

(3)Capability of identifying the encrypted P2Ptraffic:The method is independent of the P2Papplication's payload;therefore,the encryption of the data has no impact on the identification algorithm.

However,TLIhas many disadvantages,too.The disadvantages include low accuracy,poor robustness and lack of traffic classification function.

(1)Low accuracy:Two factors result in a low accuracy of TLI.One is that many P2Ptraffic characteristics are not unique,and other applications may have the same characteristics;so,the non-P2Pnetwork streams may be treated as P2Ptraffic and errors occur in traffic computation.The other factor is the complexity of the network environment.For example,the presence of asymmetric route,packet loss and retransmission make the accurate identification of traffic characteristics difficult,thus affecting the accuracy in identifying the P2Ptraffic.

(2)Poor robustness:It means the TLI method cannot solve the problem of packet loss and reorganization,and it cannot adapt itself to the complicated P2Papplications.

(3)Lack of traffic classification function:The reason why TLIcannot detailedly classify the P2Papplications is because the transport layer traffic characteristics,which TLIis based on,often cannot indicate the type of application layer protocol.For P2P applications,detailed classification is quite important for implementing traffic monitoring measures,including node banning,traffic rate limitation,and QoS improvement.

DPIis now the most commonly-used technology because it is easy to understand,upgrade and maintain.Its advantages include high accuracy,great robustness and classification capability.Because precise characteristics matching technology is adopted,the DPI technology is highly accurate,with few errors in identifying P2Ptraffic.With its robustness,the technology can handle packet loss and reorganization;so,it adapts itself to the complicated P2P applications.As for the classification capability,the technology can precisely classify the P2Papplications based on their payload characteristics.The classification function of DPIcan provide accurate information for P2Ptraffic monitoring.

The disadvantages of DPIinclude poor scalability,incapability of identifying encrypted data and poor performance.In terms of scalability,the technology has a delay in identifying the traffic of a new P2Papplication.The new P2P application traffic cannot be identified until the payload characteristics of the new application are found out,and the characteristics library is updated.In the case of P2Ppayload encryption,the protocol and payload characteristics of P2Papplications are hidden;therefore,DPIcannot effectively identify the encrypted traffic of P2Papplications.The main reason for poor performance is that this technology requires operations such as protocol resolution and reversion,and characteristic matching,thus bringing large amount of computation and storage overheads,and causing the performance of the identification algorithm to decrease.The more complicated the payload characteristics are,the larger the identification cost is and the poorer the performance is.

Table 2 is a comparison of the two P2Ptraffic identification technologies,TLI and DPI,of which the TLImethods include TCP/UDPport number identification,network diameter analysis,node role analysis,protocol pair analysis and IP-port pair analysis

5 Conclusions

TLIand DPIare the two most important P2Ptraffic identification technologies available at present.Characterized by high accuracy,great robustness and classification capability,and because most old P2Psystems are not encrypted,DPIis being used the most.However,DPIis challenged with several problems such us improving the performance of identification algorithms,enabling the algorithms to support encrypted data analysis,and updating the characteristics library of P2P applications.Similarly,in spite of having such advantages as good performance and scalability,TLIfaces many problems in practicalapplication due to its low accuracy.In addition,all the existing methods are mainly for analyzing the offline data,and they do not have the ability of identifying the P2Ptraffic in a real time mode.Basically,TLIis a heuristic method while DPIis a precise matching method.Combining the advantages of the two methods,an accurate,efficient algorithm for P2P network traffic identification can be designed.Therefore,the research on the real-time identification algorithm from heuristic deep packet investigation will be the main topic in the future.