方木云,金 瑜,吴辉冬,仇 祯,刘海波,赵鲜鲜
(安徽工业大学计算机科学与技术学院,安徽马鞍山243032)
Communication latency and physical cable length are always the major concerns for the large parallel applications deployed on High Performance Computing(HPC)systems,which could reach hundreds of nanoseconds[1-4],and thousands of kilometers[5].Thus, there are strong desire and fundamental requirement for designing a system with low latency and short cable length. Switch delays (about 100 nanoseconds in InfiniBand QDR) are larger than the wire and flit injection delays even including serial and parallel converters(SerDes)[6].Therefore,the problem of developing low latency and short cable length system could be converted into designing a topology of switches having low diameter,average shortest path length and total cable length.Thanks to the high-radix switches with dozens of ports[7],it is possible to design low-latency topologies by using multiple links per switch nowadays.Unfortunately, more links will increase cable length, so layout-conscious technology should be always taken into account for HPC oあ-chip interconnects[8-9].
Various topologies of high-radix switches have been adopted in HPC systems. Specifically,there are tori,meshes and hypercubes for directed topology design,and Fat trees,Clos network,Omega and flattened butterfly for undirected topology design[1].These topologies are either regular or irregular depending on that whether their switches have the same steps and degrees or not. They have their own advantages and drawbacks in terms of latencies,cable length, packaging complexity, and fault tolerance. Which kind of topology is more suitable for a HPC system is hard to tell, which is based on its scale, electrical power budget, surface areas, and cost. HPC systems maintain a particular trade-oあamong these attributes.The diversity of topologies provides a wide range of choices for the design of HPC on-chip and oあ-chip interconnects.
Undirected topologies in which some switches only link other switches for facilitating communication are widely adopted in oあ-chip HPC systems.However,directed topologies due to its space limitation so that no more switches could be embedded in are usually adopted in most on-chip HPC systems.
Loop topologies,as an important parallel computing pattern and potential choice for HPC systems,have been studied over several decades. Fixed-step loop topology was firstly introduced in [10], such as double-loop networks[11-13]and multi-loop networks[14-15]. Random-step loop topology was recently addressed in [1-4, 8-9].Michihiro Koibuchi et al. proposed the use of random shortcuts and found out that it could drastically reduce the diameter and average shortest path length comparing with the non-random shortcuts scheme(i.e.,a kind of fixedstep loop topologies).There are two kinds of random methods:the first one forms a ring ahead and then randomly adds shortcuts, the second one randomly adds links and thus no ring formed ahead. In this paper we use randomshortcut and random-step to denote them respectively.
Random-step loop topology can dramatically reduce latencies, but it leads to complex packaging and long physical cables which impact the cost negatively.In this paper,we propose a novel exponential-step loop topology for HPC interconnections, which not only further reduces the latency of the network by up to 30%, but also decreases the packaging complexity and physical cable length. Furthermore, contrary to the fixed-step and random-step schemes that calculate communication latencies after generating the topologies,the proposed scheme could estimate the latencies before the topology generation.
According to the number of steps,fixed-step loop topologies can be classified into double loop networks and multi-loop networks.
1) Double Loop Topologies: Double loop topologies have been widely studied in the past years due to their importance to the design of interconnection or communication computer networks. For a survey about these networks see, for example, the papers of Chen et al.[11,16-17],Aguilo et al.[12]or Hwang[13,18-19].The directed graphs that model such networks are usually called double fixed step or circulate directed graphs.A double fixed-step directed graphs with n vertices, denoted by G(N; s1; s2) , is a digraph with set of vertices V, the set of integers modulo n,and adjacencies given by the rules i →i+s1(mod N);i →i+s2(mod N),with i;s1;s2 ∈V.The integer s1 and s2 are called the steps of the directed graph.The condition gcd(N;s1;s2)=1 must hold for the directed graph to be strongly connected.
One of the most important parameters in these directed graphs is the diameter,which represents the transmission latency of the corresponding network.Other interesting parameters include the mean distance,fault tolerance,and connectivity.
Wong and Coppersmith[10]gave the lower bound of the diameter and mean distance for the directed double loop networks as
Most work focus on the minimization of the diameter for any given number of vertices.
2)Multi-loop Topologies:Multi-loop topologies were firstly proposed by Wong and Coppersmith[10]for organizing multi-module memory services.Foil et al.[12]slightly extended its definition in their study of the data alignment problem in SIMD processors. Multi-loop topologies are also widely used in local area computer networks[15,20-21]and large area communication networks like SONET[22-24].
A multi-loop network, denoted by L(N; s1; s2;…; sl), can be represented by a digraph on N nodes, 0;1;…;N-1.and l links of types:i →i+s1;i →i+s2;…;i →i+sl(mod N);i=0;1;…;N-1:When l is specified,we can also call it a directed l-loop network.A symmetric 2l-loop network,L(N;s1;…;sl;-sl;…;-sl)can be represented by a graph on N vertices with edges of l types:(i;i+s1);(i;i+s2);…;(i;i+sl)(mod N)for i=0;1;…s;N-1:
Wong and Coppersmith[10]also gave the lower bound of the diameter and mean distance for the directed and undirected multi-loop networks,i.e.,
Random-step loop topologies are investigated in [1-4,8-9,25], which have better performance comparing with the fixed-loop ones in terms of low diameter,low average shortest path length and high default tolerance.
There are two methods to add random shortcuts to a base ring in[1-4].The first one is that for a given ring with N vertices, consider a procedure by which shortcuts are added randomly to each vertices i , i.e., by picking each destination vertex uniformly among the possible 2n-1 vertices, while assuring that all vertices have the same degree so as to allow a comparison between same-degree topologies.The second one is a more sophisticated,but also more costly, shortcut generation method. When adding y shortcuts to a vertex, k×y random feasible destination vertices are found with k≥1. The length (in number of hops) of the shortest path to each of these vertices is computed.The y vertices with the longest and shortest paths are selected as the destination of shortcuts.The goal is to pick shortcuts judiciously, i.e., to avoid adding shortcuts between vertices that already have a shortest path between them.
In this subsection, we briefly introduce some topologies that are generally used for HPC interconnections with high-radix switches. In directed topologies, each switch connects directly to a number of compute nodes as well as to other switches. Popular directed topologies include k-ary n-cubes, with a degree of 2n which lead to tori, meshes, and hyper-cubes. Each topology leads to a particular trade-off between degree and diameter. These topologies are regular, meaning that all switches have the same degree as each switch is connected to the same number of switches.
Undirected topologies,i.e.,topologies in which some switches are connected only to the neighboring switches, have also been proposed. They have low diameter at the expense of larger numbers of switches compared to directed topologies.The best known undirected topologies are Fat trees[25],Clos network and related multi-stage interconnection networks such as the Omega and Butterfly networks.A popular option is (p; q; r) Fat trees, with a degree of p+q,where p represents for the number of up-ward connections,q represents for the number of downward connections,and r represents for the number of tree levels.
More recently, variations of these topologies, such as the flatten butterfly[26], have been proposed as a way to improve cost eあectiveness.In fact,for large-scale HPC systems,the network layout has a high impact on network cost,especially because longer wires must be optical if high bandwidth is required.The Dragonfly network uses a group of routers as a virtual router to reduce the wire length in the context of high-radix topologies,which is down by distinguishing inter-cabinet and intra-group networks[27].Various topologies,including exponential-step topologies,can be used for on-chip or oあ-chip interconnects.
We propose the use of exponential-step in directed loop networks, which not only can furthermore reduce communication latencies but also maintain the logical relationship among vertices when compared to its random counterpart.
In this subsection,we present a method to generate exponential-step loop topology.Assume there are N vertices noted by 1,2,…,N-1,N and degree K.The method to generate directed topologies by using exponential-step can be described as followings:
Algorithm(I)aims at linking all the vertices as quickly as possible by using breadth-first method.Algorithm(I) achieves smaller communication latencies than random step because there exist no redundant links. Each link contributes to reducing delay.The linking process is just a routing method.
Algorithm (II) is used to complement the degrees. These link relationships are not as obvious as those produced in Algorithm(I),but still more easily deduced than random-step.
Recall the strategy of exponential step topologies: first routing and then linking. The strategy of fixed-step topologies and random-step topologies:first linking and then routing.
To facilitate better understanding, we take an example in Fig. 1 to illustrate the proposed exponential-step scheme, where N=23 and K=2. In Fig. 1(a), node 1 links to nodes 2, 3. In Figs 1(b), nodes 2 and 3 links to nodes 4,5 and 6,7 respectively.In Fig.1(c),node 4 links to nodes 8,1 and the Algorithm(I)ends.Fig.1(d)shows the process of supplementing degrees: Node 5 links to nodes 6, 7. Node 6 links to nodes 8,1. Node 7 links to nodes 2,3.Node 8 links to nodes 4,5.Algorithm(II)ends.
Fig.1 The generating process using exponential-step for N=23 and K=2
We develop a simulation platform using Microsoft Visual Studio.C# to study HPC topologies.Fig.2 created by our simulation platform shows four topologies generated by algorithm with N= 26 and K = 2, 3, 4, 5, respectively. Unlike the asymmetric topologies generated by random-step scheme and the fully symmetric topologies generated by fixed-step, the topologies generated by the proposed scheme are partially symmetric topologies.Specifically, they are symmetric from global vision and asymmetric from local vision. If K = m, then there are m sparsely linked areas and m densely linked areas.The densely linked areas can be aggregated into nearly cabinets to reduce physical cable length as discussed in[8].
Fig.2 Loop topologies using exponential-step with N=26 and K=2,3,4,5
Actually, the generated topology can be regarded as a kind of tree instead of loop. For the topologies with degree k=2,3,4,…,N,they could be regarded as binary,ternary,quaternary…n-ary tree.Here,we take an example of a topology with N=8 and K=2,3,4 respectively.Fig.3 denotes the tree generated by Algorithm(I).They are binary, ternary, quaternary trees respectively.Algorithm (II) shows the process of supplementing degrees. In this topology,the shortest path from one node to arbitrary another node all forms a tree.
Fig.3 Tree topologies using exponential-step with N=23 and K=2,3,4
For a topology with vertices N,there exist N trees.We find that the heights of these trees equal to either logkNor logkN+1,where k is the degree.The maximum height is just the diameter of the topology.Finally,our topology is generated by overlying N trees which denote N shortest path for N vertices.
In this section, by the aid of our simulation platform, we compare the proposed exponential-step topologies with random-shortcut, random-step and tight optimal fixed-step topologies in terms of communication latency,physical cable length and scalability.Note that except for random-shortcut,we also choose random-step topology because it has not been taken into account in [1-4]. Tight optimal fixed-step, rather than non-random shortcut topology which has a relatively large diameter referred in [1-4], is chosen to be made comparisons with other three kinds of topologies because it reaches the lower diameter bound among the fixed-step topologies.
Communication latency is the most important property for topologies.In this section,we will make comparisons among the above four types of topologies in terms of diameter and average shortest path length in two cases.
1)Degree is constant while N is variable
Firstly we let degree K be a constant while vertices N be variable so as to view how communication latencies change with the increment of vertices.Assume K=2 and N=25, 26, 28, 210, 211respectively.Their diameters/average shortest path lengths are shown in Tab.1,and their change curves are shown in Fig.4.
Tab.1 Diameter and mean distance vs.network size
Fig.4 Diameter and average shortest path length vs.node size N for four kinds of topologies with K=2
Note that,in Tab.1“8=5/33”means that 8 is the diameter and 5:33 is the mean distance.Seen from Tab.1 and Fig. 4, for the four kinds of topologies, their diameters and average shortest path lengths increase with the increment of node N. The proposed exponential-step topologies have the smallest latencies, which is further smaller than those of two kinds of random topologies. Compared to the tight optimal fixed-step topologies, it can reduce diameter by up to 85 percent. For instance, seen from the sixth line in Tab.1, reduction-bou may be up to 85%when K=2 and N=211.Compared to the two random topologies,it can reduce diameter by up to 30 percent or so.For instance,seen from the seventh line in Tab.1,reduction-ran maybe up to 33%when K=2 and N=26.Finally we find that random-shortcut and random-step topologies have the same diameters and nearly the same average shortest path lengths. Tight optimal fixed-step topologies, though reaches the lower diameter bound among the fixed-step topologies,have much larger diameters and average shortest path lengths than those of our proposed exponential-step and random counterparts.
2)N is constant while degree is variable
Secondly we let vertices N be a constant while degree K be variable so as to view how their communication latencies change with the increment of degree.Assume N=211and K=2,3,4,5,10 respectively.Their diameters/average shortest path lengths are shown in Tab.2,and their change curves are shown in Fig.5.
Tab.2 Diameter and mean distance vs.degree
Fig.5 Diameter and average shortest path length vs.degree K for four kinds of topologies with N=2 048
From Tab. 2 and Fig. 5, with the increment of K both the diameter and mean distance decrease for the four kinds of topologies. Other observations are similar to those of the above experiment. For instance, the proposed exponential-step topologies have the smallest latencies, which is further smaller than those of two kinds of random topologies. The diameter reduction-bou may be up to 85 percent when compared to the lower diameter bound of fixed-step topologies. Meanwhile the diameter reduction-ran may be up to 30 percent or so when compared to the two random-step topologies.
Two practical concerns when deploying an HPC interconnects are packaging complexity and physical cable length which impact layout and costs respectively[1-4].
Here we can’t give a quantitatively analysis for the pack-aging complexity and physical cable length,but we can give a qualitative analysis.Firstly,unlike the two random topologies which are asymmetric and possess chaos link relationship,the proposed exponential-step topology is symmetric and the node link relationships are certainly.So it has smaller packaging complexity.Secondly,as we known from the former section,there exist k denselylinked areas and k sparsely-linked areas in our proposed exponential-step topologies.The densely-linked areas can be aggregated into nearly cabinets to reduce physical cable length.Thirdly, the proposed exponential-step topologies can be laid out as a tree, so it is easy to be packaged. Its cable length can be smaller and packaging can be easier than those of the two random topologies.
Scalability is the ability of a network to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth[1].In one meaning,it can refer to the capability of a network to accept new vertices while ensure the original links can be modified as little as possible.When a new node must be added to an already existed layout, how many links have to be adjusted is a practical problem. Of course, the more less link-adjusted the more cheap fee-charged. For fixed-step and random-step topologies, when a vertex is added anywhere into the topology, only those links, which are linked to this new vertex, must be adjusted. But random topologies are more easily adjusted than fixed-step topologies. For the proposed exponential-step topologies, the price to add a new vertex is expensive. When the vertex is added to the start position of the topology, all links have to be adjusted.When the vertex is added to the middle position of the topology,all subsequent links have to be adjusted. When the vertex is added to the end position of the topology, the adjusted links is small. In a word,from this point of view the scalability of the proposed exponential-step topologies is the worst among the four kinds of topologies. On the other hand, scalability can also refer to the ability of a network to accept new nodes while maintain its diameter unchanged. Seen from Fig. 4, when N increases the diameters and mean distance of exponential-step topologies increase the most slowly.For a given diameter,the proposed exponential-step topologies can accommodate much large scope of vertices.So its scalability is the most best among the four kinds of topologies from this point of view.
1) In order to achieve lower communication latencies, as well as simpler packaging and shorter physical cable length, we have proposed the use of exponential-step in designing topologies, which take latencies into account ahead of the generation.
2)Simulation results show that it can further reduce diameter by up to 30 percent or so when compared with random counterparts, and maybe reduce by up to 85 percent when compared with the lower diameter bound of fixed-step counterparts. If its degree equals to k, the proposed directed exponential-step loop topology have k densely-linked areas and k sparsely-linked areas that can reduce packaging complexity and physical cable length by gather densely-linked nodes into the same cabinet or nearly cabinets.
3) In additional to being viewed as loop topology, the proposed topology can also be viewed as tree topologies in which there exist N overlying trees whose degrees are the same.