Domain-Oriented Software Defined Computing Architecture

2019-07-08 02:00PingLvQinrangLiuHongchangChenTingChen

China Communications 2019年6期

Ping Lv＊,Qinrang Liu,Hongchang Chen,Ting Chen

National Digital Switching System Engineering and Technical R&D Center,Zhengzhou 450002

Abstract: With the introduction of software defined hardware by DARPA Electronics Resurgence Initiative,software definition will be the basic attribute of information system.Benefiting from boundary certainty and algorithm aggregation of domain applications,domain-oriented computing architecture has become the technical direction that considers the high flexibility and efficiency of information system.Aiming at the characteristics of data-intensive computing in different scenarios such as Internet of Things (IoT),big data,artificial intelligence (AI),this paper presents a domain-oriented software defined computing architecture,discusses the hierarchical interconnection structure,hybrid granularity computing element and its computational kernel extraction method,finally proves the flexibility and high efficiency of this architecture by experimental comparison.

Keywords: software defined hardware; software defined computing architecture; hierarchical interconnection; mixed-granular computing element

I.INTRODUCTION

The system architecture (Von Neumann architecture) designed by John von Neumann(the father of computers) in 1940 remains the mainstream of computing architecture and has not undergone essential changes for decades.The basic feature of Von Neumann architecture is that the computing model of“sharing data and execution in series”.Programs and data are stored in shared memory,the processor fetches instructions and data for corresponding calculations.That is,the memory and the processor are separated from each other and communication between them is through the system bus.

Historically,thanks to the continuous evolution of semiconductor technology following Moore's law,the throughput and system performance of computing architecture keeps on improving.However,as the semiconductor process approaches the silicon-based limit,Moore's law begins slowing down,and solely relying on process progress to improve performance is encountering bottleneck.With the extensive application of information technology such as IoT,big data and AI,the data resources in information infrastructure is explosively growing,this mismatch leads to the technological and industrial reforms.Frequent data exchange in Von Neumann architecture causes a large amount of power wasting on the bus and restricts the speed of information processing,which result in the famous “Von Neumann bottleneck”.

To use the explosive growth of data resources,Information technology is facing severe challenges in big data transmission,management,calculation and analysis.In order to effectively deal with data-intensive tasks of big data and the fragmented distributed computing system of AI,the academia and industry circles are turning to the innovation of computing architecture,hoping to achieve tremendous computing performance and productivity improvements to effectively meet the flexible,efficient,and highly integrated processing requirements in the era of big data and AI.

In 2016,Forbes listed the “new computer architecture” as the top five technologies that will affect the world in the next 15 years.In 2017,DARPA launched ERI (Electronics Resurgence Initiative [1]) to meet the challenges of engineering technology and economic cost in the microelectronics technology field.One of three main supporting areas was architecture innovation,where DSSoC (domain-oriented system-on-chip) and SDH (software defined hardware) were two topics.In 2018,at the International Conference on Computer Architecture (ISCA 2018),John Hennessy and David Patterson,the Turing Award winners,envisioned that architecture is ushering in a new golden age [2],and pointed out that domain-oriented special architecture is the future developing direction.

A domain-oriented software definition computing architecture is proposed in the second part.The experimental comparisons with general server in the image recognition application are made in the third part.The future development of computing architecture is prospected in the fourth part.

II.DOMAIN-ORIENTED SOFTWARE DEFINED COMPUTING ARCHITECTURE

Computing architecture has experienced single-core computing architecture,multi-core concurrent computing architecture,heterogeneous computing architecture and reconfigurable computing architecture [2-5].At present,in order to get rid of the limitation of instruction stream sequential execution in von Neumann architecture and meet the requirements of efficient computing for large-scale concurrent data streams in the IoT and big data era,especially for flexible and efficient data-intensive concurrent algorithms which can bear indepth learning,following the idea of “structure adapting application”,this paper presents a software defined computing architecture for interconnection structure,computing element and processing flow.

This paper presents a domain-oriented software defined computing architecture,discusses the hierarchical interconnection structure,hybrid granularity computing element and its computational kernel extraction method.

2.1 Hierarchical interconnect structure

Although two-dimensional Mesh (2D-Mesh)based NoC is the common topology of multicore CMP,it has three problems when being used in large-scale system: first,NoC's power consumption rises obviously with the increase of connection units; second,network bandwidth is consumed quickly as the concurrent tasks increase; finally,the fixed network topology neither support the locality principle of hierarchical data driven execution model,nor adapt to the change of communication mode.An interconnection network supporting software definition becomes very important,which can not only selectively shut down unused links to reduce power consumption,but also support the dynamic change of topology with the demand of communication mode.

An expandable 2D-Mesh is proposed in[6].The router function is implemented by a lightweight ring switch.This simple micro-architecture ensures the good scalability and low power consumption; however,its network topology is fixed and the utilization of some links is not high.The integration of a reconfiguration function into the traditional router micro-architecture is presented in [7].Despite its energy-saving advantage,router's traditional micro-architecture,crossbar switch and link driver still consume most of the power.The NoC topology software programming method is depicted in [8].Its physical substrate is based on 2D-Mesh network,which is built on the simple ring running on the X and Y dimensions of the grid.The ring station can expand and forward traffic from one dimension to another.In addition,the instruction sets of processing elements can also be extended to control network configuration and scheduling application threads.A hybrid software-hardware NoC architecture is described in [9].Compared with the traditional NoC,the control plane runs as a software module on a dedicated core which can make dynamic decisions on network configuration.According to current status and application requirements,this solution can easily adapt to irregular traffic patterns,but its centralized components are the bottleneck that affects the overall scalability.How to control packet processing and make hardware-independent switching are proposed in [10].The programmer first decides the processing function of the packet,then the compiler converts the program into a dependency graph table that can be mapped to multiple speci fic target switches to tell the switch how to process packets.A programmable and protocol-independent forwarding SDN based on universal stream instruction set is mentioned in [11].The main idea of this work is to eliminate the dependency on protocol-specific configuration of forwarding elements,and to enhance data paths with new state instructions.This method can reduce network cost by using commercialized forwarding units.An open packet processor is discussed in [12]to improve the interaction between platform-independent hardware configurability and packet-level programming flexibility.It aims to help programmers deploy state forwarding tasks,and proves the feasibility of the extending finite state machine as the underlying data plane planning abstraction,state operation and flow level feature tracking.

Fig.1.Hierarchical interconnection structure.

As shown in Figure 1,on the basis of 2D-Mesh this paper proposes a hierarchical interconnection structure,where local communication uses software defined interconnection(SDI) and global communication is based on software defined 2D-Mesh.

This structure divides the computing element into clusters.Four computing elements forms a computing cluster,where elements are connected by SDI [13]structure.The interconnection among four computing elements need features of high bandwidth,low latency,frequent communication and various modes.The characteristics of SDI can meet these requirements naturally.In this SDI architecture shown in figure 1,one port is connected with global communication router,other five ports are connected with four computing elements and one storage element.Accordingto the requirement of communication mode,the bandwidth of each port can be flexibly defined as symmetrical and asymmetrical mode.At the same time the six ports can also be flexibly defined as ring,tree and grid topology structure to meet the needs of different algorithm communication mode.All ports can support not only low latency packet switching but also circuit switching based on configuration.Through this solution,four computing elements contain some isomorphic or heterogeneous local cache,and support defining multiple computing elements into higher granularity computing elements through iteration to meet the diversity requirements of heterogeneous computing elements with different granularity in high energy-efficiency computing application.

Computing element clusters are connected through software defined 2D-Mesh structure.Routers include four ports connected with software defined interconnection and four ports for router connection in four directions(east,west,north and south).Routers support packet instruction driven addressing and flow control.Unused links or computing element cluster can be disabled by configuration to control power consumption.

This hierarchical interconnection structure not only keeps the good connectivity of the 2D-Mesh structure,but also overcomes the limitation in scale expansion application through local software defined interconnection.It also makes a good topological adaptation to the local communication characteristics of data flow,and shows good performance in flexibility,scalability and power control.

2.2 Heterogeneous granularity computing element

sIn order to achieve high performance for domain-oriented applications,we adopt a special computing element design for domain application algorithms.At the same time,to meet the flexibility of supporting different scenarios of domain applications,we propose a heterogeneous granularity computing element based on software definition.First,we analyze the algorithm characteristics of domain applications;then determine the efficient computing boundary,extract and construct the basic computing kernels,which can generate heterogeneous granularity computing elements by recursive definition.

The computing core is the basic component of the domain application algorithm,and can be obtained by decomposing the domain application algorithm.A computing core set is a group of arithmetic kernels.A complete application algorithm can be supported by integrating and combining arithmetic kernels.The core of heterogeneous granularity computing element development is the solution of computing core set,.And the solution algorithm is as follows.

Assume:

The set of target applications supported by the domain application isL= {l1,l2,… ,ln}

The weight of the applicationLiin the target set is marked asαi,

The basic computing core set to be sought:

FFidenotes the number of the i-th basic computing core and Cidenotes the number ofFFi.

The available resources in the domain application computing chip is marked as R,the resource for FFiisri,total resources

The resources occupied by applicationlisoftware definition is marked asrli,the total software definition cost of software definition is

Computing coreFFi,j,j is the serial number ofFFi,1 ≤j≤ci,when calledlm,it is shown as:

The utilization rate of computing coreFFi j,is

A reasonable basic computing core set is to achieve the similar probability of being used for each computing core on the premise of meeting the requirements of domain application software-definition.That is to say,when the variance of the computing core's utilization rate is the smallest,the correspondingis the solution.

The algorithm to get the basic computational core set is an iteration process,starting from the simple union set of computing core set in domain application and adjusting the computing core set to minimize the variance of the computing core utilization rate.Figure 2 shows the flow chart of extracting algorithm of heterogeneous granularity computing core.

Taking basic computing core as the basic element of software definition,using distributed design,decoupling the computing unit and storage unit in the chip,integrating multiple computing/storage units according to resource requirements,finally on-demand scaling of computing core granularity is realized.The program uses the computing element cluster as the unit of mapping and execution.The resources of computing element cluster do not affect the correct execution of the program.Thus,the hardware resource is decoupled from the execution model.

With the support of the National Program Project,we have developed a Proactive Decision Support System (PDSS) tool platform,which can automatically analyze the algorithm and generate the optimal basic computing core set [14].

2.3 Data flow driven parallel execution

Fig.2.Computig core grain extraction process.

In the data flow execution model,the instructions follow the “ignition principle” to determine whether to execute.That is to say,all operands of any instruction in the program can be executed when they are ready,without the need for program counter to monitor the execution flow.Compared with traditional control flow driven execution model,data flow driven execution model has three advantages.

First,the data flow driving mode determines the execution order of instructions following the data flow graph of the program.There is no restriction on the execution order of instructions.In theory,as long as the hardware resources are sufficient,instruction level parallelism can be achieved to the maximum extent.Second,the data flow-driven model can fully exploit thread-level parallelism in the program.For example,the loop bodies of a loop code can be expanded simultaneously,where the dependent part forms an asynchronous pipeline,and the independent part can be regarded as the independent thread and executed in parallel.Thirdly,the data stream driven model does not need complex control logic to maintain instruction execution sequence.Instruction memory and executable instruction queue cache can adopt distributed structure,which greatly reduces the complexity of hardware design.

The two-level flow programming model based on flow level and core level can fully realize the parallelism of different levels.This model decomposes the application into a series of computing elements,and data is computed online through computing elements in the form of flow to form a complete data flow graph.Flow-level programming model is responsible for organizing data,calling core-level programs and complex parts of control flow in computing programs,using software pipelining technology is to mine thread-level parallelism in programs.The core-level programming model mainly describes the computing-intensive parts of programs,and can be designed into SIMD or short vector form according to hardware structure,mining instruction-level and data-level parallelism in programs.The two-level programming model can decouple the computing and communication parts in application.This programming model does not depend on the specific hard-ware structure,which makes the distributed execution of programs easier.

In order to avoid changing the programming model of the application,we use the class data flow driven execution model,which combines the advantages of control flow and data flow.In the execution model,data flow graph is used as the intermediate representation.The compiler organizes the program into instruction blocks according to the data flow graph.The partial order relationship in the control flow execution model is maintained among instruction blocks.The execution model driven by data flow is used within instructions,where instructions are running in parallel.By using this execution model combining data flow and control flow,the functions of software and hardware can be divided more reasonably.At the same time,the ability of both software and hardware can be fully utilized to mine parallel and executable instructions to improve the efficiency of program.

Figure 3 shows the basic process of how a specific program is divided and mapped to a class data stream driven execution model.First,the compiler statically checks the execution phase of the entire program,converts the source program into a sequence of instruction blocks with partial order relation of control flow according to control fashion,then constructs the data flow graph of each block,and transforms the instruction sequence in the block to the control flow relationship according to the data flow graph.

It can be found from the graph that although the original control flow in the program has not been completely eliminated,it is only at the block level,and there is no control dependency between the instructions in the block.Instructions can be transmitted and executed in disorder following the sequence of the data flow.

2.4 Domain-oriented software defined computing architecture

Based on the hierarchical interconnection structure,heterogeneous granularity computing units and data flow-driven execution mod- el mentioned in the previous three sections.Figure 4 presents the domain-oriented software defined computing architecture proposed in this paper.

Different from general processors which use repeatedly instruction fetching,counting,computing and storing intermediate results,the workflow of domain-oriented software defined computing architecture is: software defined computing element (cluster),initialization configuration,on-line processing and storing results.

The computing architecture includes computing elements (CE),local memory (Mem),software defined interconnects (SDI),reconfigurable routers (R),and external interface units (IOU).Computing elements can accomplish a task independently or cooperate with other elements to form a computing cluster to complete a task.According to the principle of“online processing and path-by-channel caching” of data streams,each computing element contains a cache for basic requirements,and each computing cluster shares a local storage to support fast processing and local caching of highly concurrent data flow.Different computing elements (clusters) are connected by network reconfigurable routers that communicate globally,and support data flow routing and forwarding according to packet instructions.Each edge router has a special external interface unit,which can be instantiated as a high speed interface or an external cache interface to achieve high-speed,large-capacity data throughput and off-chip storage.

Fig.3.Mapping of a serial program on the data flow driven execution model.

Based on the software defined computing architecture,programmers can explicitly decompose computing tasks and data,assign different computing tasks to different computing elements for parallel execution to fully exploit data-level parallelism in the program.Computing elements can be defined as isomorphic or heterogeneous by software according to the task types of application scenarios.They are also flexible to support forming computing clusters with different scales to minimize communication overhead to the utmost extent,especially for long-distance communication overhead.To facilitate the convenience of applications with software defined computing architecture,Open CL or other domain-specific programming languages ,such as P4,Scala[15～17],etc.,can be used to manage computing and interconnecting resources explicitly to online optimize the computing and interconnecting behavior following the application requirement to get the best dynamic energy efficiency ratio.

III.EXPERIMENTAL EVALUATION

In order to verify the flexibility and energy efficiency rate of software defined computing architecture,we developed a video copy detection system verification platform using this architecture.This platform is a part of PDSS tool which is developed on the basis of National Program Project.There are four XILINX XC7VX1140T used as software def inition processing nodes,which is full interconnected through GTX and LVDS interfaces.Each FPGA is connected with three 16G DDR memory.Each platform has four 10GE data interfaces,four customer optical interfaces,four GE control interfaces.The system implementation structure is shown in Figure 5.

Fig.4.Domain-oriented software defined computing architecture.

Video copy detection algorithm [18～20]mainly includes two processes: feature extraction and feature detection (similarity calculation).Meanwhile,considering the time correlation of video frames,the method of one weak single frame feature combined with continuous multi-frame detection method is adopted.Among them,feature extraction is based on simple global features of gray-scale image,and each frame feature can be represented as a N-dimensional binary vector (N can be 32,64 or 128).This calculation process is simple and is not a computing bottleneck.Similarity calculation is to do the exhaustive distance calculation using a single frame feature with those in the feature library,and then do the aggregation judgement for continuous multi-frame.The aggregation degree can be 16,32.or 64 frames.Similarity calculation accounts for the vast majority of the calculation of the whole process,which is the bottleneck of calculation.Targeting this bottleneck problem,based on the above verification platform,the matching computation is realized by using the computing particle extracting algorithm and software defined architecture proposed in this paper.Performance and effectiveness comparison are made with the implementation of general server.There are 9 application modes according to the 32/64/128-bit calculation accuracy and the aggregation unit of 16/32/64 frames,which is shown in table 1.

For each application mode,compiler can dynamically map and generate different computing structures according to the specific application mode and load conditions.Figure 6 shows the hardware circuit mapping and implementation structure when the aggregation unit is 16,32 and 64 frames.

In the figure,D16/D32/D64 represents the different precision Hamming distance calculation and the similarity judgment unit of different aggregation degrees in the 16/32/64 frame aggregation mode,and is also the most densely calculated part; C16/C32/C64 represents the result aggregation computing unit in the frame aggregation mode of 16/32/64,which corresponds to D16/D32/D64 one by one; LIBi (i=1,2,3,4) represents video feature fragment.In the 16/32/64 frame aggregation mode,each LIB needs one D16+C16,two D32+C32 and four D64+C64 to implement real-time pipeline calculation.Final aggregation calculation results are sent to M16/M32/M64 for selecting partial results as the output.

The above solution is compared with that built on IBM X3850 X6 general server.The precision and computing efficiency of these two implementation methods are tested respectively.Among them,precision is mainly expressed by accuracy and recall rate,while computing efficiency is mainly expressed by calculation delay and power consumption.The precision testing is achieved by comparing the original video in the video library with that going through six changes including landmark,caption,resolution,brightness,contrast and aspect ratio to check the similarity.The computing efficiency is measured by using more than 15 million frames' feature libraries to simulate the real load curve for 1～2 hours,and nine models mentioned above are compared.The experiment results show that the precision and recall rate of the two architectures are the same.Compared with the single general server,the software defined computing architecture can improve the performance from 211 to 454 times,efficiency can increase from 155 to 323 times.As shown in Figure 7.

Fig.5.Architecture of video copy detection system.

TableI.Nine application modes.

Fig.6.Software defined computing structure: mapping,layout and implementation of three application mode.

IV.DEVELOPMENT PROSPECT

With the emergence of new applications,there are growing demands for computing capability from image and speech recognition to pilotless automobile,to beating the top I-go players.We are experiencing the era of explosive growth in demand for visual data computing and understanding.This kind of interconnection is developing quickly from people and people,to people and things,and more versatile things and things.To meet the demand of large data concurrent online processing,we must get rid of the bottleneck of traditional von Neumann architecture and create a highly efficient computing architecture that can be applied in different fields and has enough flexibility.A domain-oriented concept is presented,and a software defined computing architecture for extracting basic computing elements are proposed in this paper.The flexibility and efficiency of this architecture are verified by experiments.The in-depth research around this architecture in different fields will be conducted,including computing element structure implementation,software defined interconnection implementation,reconfigurable network routers implementation,data packet-driven computing communication model and software-hardware co-programming model.All of them are the follow-up research directions.

Fig.7.Software - defined computing architecture and common server performance and efficiency comparison.

China Communications2019年6期

China Communications的其它文章: Optimal Energy-Efficient Transmission for Hybrid Spectrum Sharing in Cooperative Cognitive Radio Networks; User Assisted Cooperative Relaying in Beamspace Massive MIMO NOMA Based Systems for Millimeter Wave Communications; Power Reallocation and Complexity Enhancement for Beyond 4G Systems; Frameworks of Non-Orthogonal Multiple Access Techniques in Cognitive Radio Communication Systems; A VLIW Architecture Stream Cryptographic Processor for Information Security; An Improved DV-Hop Localization Algorithm Based on Hop Distances Correction