-,
(School of Engineering and Technology, China University of Geosciences (Beijing), Beijing 100083, China)
Abstract This paper presents a parallel Graham scan algorithm for calculating two-dimentional (2D) convex hulls of scattered points on Graphic Processing Units (GPUs). The proposed parallel algorithm consists of two stages: (1) two rounds of preprocessing performed on the GPU and (2) the finalization of finding the convex hull on the CPU. We first discard those interior points locating inside a quadrilateral formed by four extreme points, sort the remaining points according to the angles, and divide them into the left and the right regions. For each region, we perform a second round of filtering using the proposed preprocessing approach to further discard interior points. We finally obtain the desired convex hull by calculating the convex hull of the remaining points. We strongly exploit the parallel sorting, reduction, and partitioning provided by the library Thrust for better efficiency and simplicity. Comparative results indicate that our parallel algorithm can calculate the convex hull of 20 M scattered points in less than 0.5 second on personal computers, and achieve a speedup of 6x~7x over the state-of-the-art baseline algorithm (i.e., the famous QuickHull algorithm). Although our algorithm cannot be much faster than the QuickHull algorithm, it is very simple and easy to implement when compared to some related work. It could be an alternative in practice.
Key words convex hull; Graham scan; divide-and-conquer; parallel algorithm; GPU
In geometric modeling, the calculation of the two-dimensional convex hull of a set of planar pointsSis to find the smallest polygon that contains all the points inS. Calculating convex hulls is one of the essential issues in many fields, including computer science, robotics, GIS, and etc. Several classic convex hull algorithms have been proposed[1-7]; most of them run inO(nlog n) time. Among them, the Graham scan algorithm[1]is the first practical convex hull algorithm, while the QuickHull algorithm[7]is the most efficient and popular one in practice.
Recently, due to the fact that the computational capability of Graphic Processing Units (GPUs) has surpassed that of the Central Processing Units (CPUs), GPUs are being widely used in various applications. There are also some research work focusing on accelerating the calculation of convex hulls by utilizing the GPU. For example, Srikanth, et al.[8]used NVIDIA GPU and Cell BE hardware to accelerate the calculation of 2D convex hulls based on the QuickHull approach[7]. Similarly, Srungarapu, et al.[9]and Jurkiewicz, et al.[10]developed the parallel QuickHull algorithm to accelerate the calculation of 2D convexhulls. Moreover, Mei, et al.[11]developed a parallel CudaChain algorithm for finding 2D convex hulls on the GPU.
Also by adopting the QuickHull, Stein, et al.[12]put forward a new parallel algorithm for figuring out the convex hull of a range of points in 3D. Tang, et al.[13]presented a CPU-GPU hybrid algorithm to figure out the convex hull of points in multidimensional spaces. Tzeng and Owens[14]developed a framework to accelerate the computing power of convex hull in the Divide-and-Conquer fashion through using QuickHull. Based on the Chan's 3D convex hullalgorithm[15], a pure GPU-accelerated Divide-and-Conquer parallel algorithm was developed by White and Wortman[16]to compute 3D convex hulls. Furthermore, Gao, et al.[17]introduced a two-phase 3D convex hull algorithm by utilizing the Voronoi diagram. Moreover, Gao, et al.[18]developed the algorithm ffHull, which allows to insert vertices before flipping edges.
In this paper, we develop a parallel version of the classic Graham scan algorithm for calculating two-dimensional (2D) convex hulls of scattered points on the GPU. There are typically two stages in the Graham scan algorithm: the first stage is to sort points according to their angles; and the second is to loop over all the sorted points in sequence to determine extreme points. As we can see, it is the number of the sorted points that determine the efficiency bottleneck of the Graham scan. If we can discard the interior points that need to be sorted more, the computational efficiency will be more effective. The most straightforward case in 2D is to create a convex quadrilateral using four extreme points that have min or maxxorycoordinates, and then check each point to determine whether it locates inside the quadrilateral[4]. Several recent efforts for efficiently discarding interior points were introduced in[19-26].
We employ the above straightforward strategy[4]to improve the efficiency of our algorithm. In addition to the simple preprocessing procedure introduced in[4], we also propose a novel preprocessing approach that is well suitable to be used in the Graham scan. In our algorithm, we perform the following two efforts: (1) we apply two effective preprocessing procedures by discarding interior points to filter the points with no need for sorting; and (2) we sort the remaining points extremely fast using the efficient parallel sorting algorithm.
Similar to the traditional Graham scan, our GPU-accelerated parallel Graham scan algorithm is also composed of two stages. The first stage mainly includes two rounds of preprocessing procedures and a fast sorting of points, which is performed on the GPU. The second stage carried out on the CPU is finalizing the convexhull. Our algorithm is worked more efficient and simply by highly making use of the libraryThrust[24].
The rest organization of this paper is as follows. First, Section 1 introduces the outline and several basic ideas of ouralgorithm. Then, Section 2 provides some implementation details. Section 3 shows the experimental results which was then discussed in Section 4.
The GPU-accelerated parallel convex hull algorithm is designed on the basis of the conventional Graham scan. There are typically two stages in the conventional Graham scan algorithm: the first stage is to sort points according to their angles; and the second is to loop over all the sorted points in sequence to determine extreme points. The efficiency bottleneck of the Graham scan is the sorting of points according to their angles.
It is quite time-consuming to search for the convex hull of a large set of points; and thus the overall procedure of finding the convex hull could be very slow. To handle the above problem and improve the efficiency, we perform the following two efforts on the GPU to speed up the calculating in the first stage: (1) we apply effective preprocessing procedures by discarding interior points to filter the points with no need for sorting; and (2) we sort the remaining points extremely fast using the efficient GPU-based parallel sorting algorithm.
Similar to the conventional Graham scan, our algorithm is also composed of two stages. The first stage is conducted on the GPU which consists of two preprocessing procedures and a fast sorting of points. The second stage is the finalization of calculating the convex hull, which is conducted on the CPU. More specifically, the process of the developed algorithm is presented as follows:
① Search for the four extreme points owning the max or minxorycoordinates utilizing parallel reduction, define them asPminx,Pmaxx,PminyandPmaxy.
② Determine the distribution of all points in parallel and discard the points locating inside the convex quadrilateral formed byPminx,Pminy,Pmaxx, andPmaxyusing parallel partitioning.
③ Calculate the distance and angle concerning the lowest point, i.e.,Pminx, for each of the rest points.
④ Sort all points in the ascending order of angles using parallel sorting.
⑤ Find the point that has the longest distance, denote it asPlongestusing parallel reduction.
⑥ Divide the list of sorted points into the left and the right region using the pointPlongest, see Figure 1(a).
⑦ Perform the proposed preprocessing approach for both the right and the left regions to discard the interior points further.
⑧ Search the convex hull from the points left.
Fig. 1 The preprocessing procedure for checking and discarding interior points
First, the interior points of quadrilateral are preprocessed in the first round (i.e., the Step 1 and Step 2); then the sorting of the remaining points is performed after calculating the distance and angle of each point (Step 3 and Step 4). In Steps 5~7, the inner points of thepresorted points are further removed through the second round of preprocessing. These 7 steps are fully conducted on the GPU. The Step 8 is to find the convex hull among the left points, which is performed on the CPU. This step is the familiar with the finding of the convex hull of a series of pre-sorted points in sequential Graham scan.
1.2.1 Our Essential Ideas behind This Round of Preprocessing
Fig. 2 The pseudo-code of the proposed preprocessing approach
This section presents a novel preprocessing method for further discarding the points inside the convex hull. The essential idea behind this approach is to take advantage of the order of sorted points for checking whether a point locate inside a triangle. Typically, when checking whether or not a point locates in a triangle, it needs to judge whether or not the point is on the left side of each edge of the triangle.
For example, in Figure 1(b), there are totally 15 points; and 14 points have been sorted according to their angles. Taking the pointP4for an example, it generally needs to decide whetherP4is on the left side of the directed lineP0P3,P3P8, andP8P0. However, because the points are sorted in ascending order by angle, the angle ofP4is greater than that of the pointP3and less than that of the pointP8. Therefore, it is obviously thatP4is on the left side of the directed lineP0P3, and on the right side of the directed lineP0P8. Hence, it only remain the case that whether or notP4is on the left side of the directed lineP3P8for determining whetherP4locates in the triangleP0P4P8.
In short, it is needed to check the pointP4whether it is located on the left side of the directed lineP3P8for determining whether it is an interior point.
Similarly, for the pointP11, it is only needed to verify whether it is located on the right side of the directed lineP12P8. By exploiting the advantage of the order of the sorted points, the computational cost can be reduced.
The pseudo-code for this method is listed in Figure 2.
1.2.2 Correctness of the Preprocessing Approach
It is clear that: when searching for the convex hull, if a point is located in a triangular or quadrilateral convex polygon formed by other points, it must be an interior point and can be discarded directly. In Section 1.2.1, we have explained that it only needs to judge whether a point locates on the left or right side of a directed line for determining whether or not it is an interior point. In the proposed preprocessing approach we only remove those points that are obviously identified as interior ones. We do not discard any potential extreme points. Hence, the correctness of proposed preprocessing approach can be guaranteed.
In the proposed preprocessing approach (see Figure 2), data dependencies exist in the discarding of interior points for that the pointPtempneeds to be dynamically determined. For example, after checking the pointP2, the pointPtempis assigned toP2for checking the pointP3. When checking the pointP4, the pointPtempis exactly the pointP3; however, after checking the pointP4, the pointPtempis still the pointP3. The pointPtempcannot be independently determined when checking each point.
Due to the data dependency issues existing in the proposed preprocessing approach, it is well suited to be implemented in the sequential programming pattern. To implement this approach in the parallel programming pattern, an effective solution is to firstseparate the large series of pre-sorted points from some smaller subsets of points, and then check the internal points of all the subsets in parallel. For an individual subset of points, the checking of interior points is still performed in sequential programming pattern. This solution is in the Divide-and-Conquer fashion.
The basic idea behind this round of discarding is straightforward. Four extreme points withmin or maxxorycoordinates can be easily located and then used to form a convex quadrilateral; any interior points can be directly discarded. This round of discarding is easily implemented in sequential.
The checking of each point to judge whether it falls in the convex quadrilateral does not have any data dependencies. In other words, It's well to check all points in parallel. We design a simple CUDA kernel to parallelize this checking. Each thread is in charge of determining whether a point locates inside the convex quadrilateral. An arrayintpos[n] is allocated on the GPU to store the indicator values of all points. If a point is in the quadrilateral, then its indicator value will be assigned to 0; otherwise, 1.
The calculating of the distance and angle for each point is straightforward. We also design a kernel to compute the distances and angles in parallel. Each thread within the thread grid is invoked for calculating the distance and angle for each point. Results are stored in two arrays,dist[n] andangle[n]. The angles will be used to sort points; and the distances will be employed to perform the second round of discarding.
To speed up the sorting of all the remaining points according to their angles, we use the efficient function provided by Thrust, i.e.,thrust:sort_by_keys(). The keys that are used for sorting is the angles of points. Several usefulzip_iteratorsare created to combine the coordinates, angle, and distance of all points into a virtual array of structures. This sorting based on 32-byte keys is extremely fast and thus can improve the efficiency of the entire implementation.
We specifically design a CUDA kernel for each region to further filter the interior points utilizing the proposed preprocessing approach. More details about this preprocessing approach are introduced in Section 3.1.
We have tested our implementation against the Qhull library[28]using three groups of test data on two different platforms (see Table1.). The Qhull library[28]is the fastest implementation of the state-of-the-art baseline algorithm, i.e., the QuickHull algorithm[7]. More details about the platforms are listed in Table 1. Moreover, three groups of datasets are generated for testing.
(1) The first group of test datasets is composed of 5 sets of points randomly located in a unit square that are generated using therboxcomponent provided by Qhull.
(2) The second group of test datasets consists of 5 sets of randomly located points in a circle.
(3) The third group of test datasets includes 5 point sets that are derived from 3D mesh models bymapping vertices of each 3D mesh model onto thex-yplane. The mesh models are directly obtained from the Stanford 3D Scanning Repository(1)http://www.graphics.stanford.edu/data/3Dscanrep/and the GIT Large Geometry Models Archive(2)http://www.cc.gatech.edu/projects/large_models/.
3.1.1 Comparative Experimental Results for the Point Sets Locating in a Square
The running time on the GTX 660M for the first group of five datasets is listed in Table 2. To analyze the computational cost between the GPU CPU sides, we specifically record the execution time separately for the two sides.
Experimental results show that the speedups of the proposed parallel algorithm over the Qhull increase with the increase of data volume. However, the increasing of the speedup is not significant. And the speedup is approximately 3x~4x on average. The workload percentage of the CPU side is much less than that on the GPU side; and it sustainable decreases with the increasing of test data size. In addition, the workload percentage of the CPU side is usually less than 10%.
Tab. 1 The platforms used for testing
3.1.2 Comparative Experimental Results for the Point Sets Locating in a Circle
For the second group of test data, the running time is listed in Table 2 The speedup is approximately 5x~6x on average and 6x~7x in the best cases.
Tab. 2 Comparison of running time for the point sets locating in squares on GTX 660M
3.1.3 Comparative Experimental Results for the Point Sets Derived from Mesh Models
A shown in Table 3 below, the speedup of gScan over Qhull is approximately 5x~6x on average and 6x~7x in the best condition, which is greater than that for the points locating in squares but less than that for the points locating in circles.
Tab. 3 Comparison of running time the point sets locating in circles on GTX 660M
There are two rounds of discarding in the proposed parallel algorithm. The first round ofdiscarding is based on four extreme points with min or maxxorycoordinates. We propose an angle-based preprocessing approach for the second round of discarding. To evaluate the effectiveness of the proposed preprocessing approach, we count the remaining points after each round of discarding, then calculate the corresponding percentages of remaining points, and finally compare the effectiveness of two rounds of discarding. The results show that our preprocessing approach can dramatically reduce the number of remaining points, and therefore improve the overall efficiency of gScan. In addition, the effectiveness of the second pass of discarding becomes better with the increasing of the data size. Comparing the results generated for those three groups of test data, the results for the group of sets of points in circles are the best.
Tab. 4 Comparison of running time the point sets derived from models on GTX 660M unit:ms
According to the performance gains that were reported in existing related work[8,9,14], our implementation cannot achieve higher efficiency than that of them. In[8], it was reported that the speedup of their GPU implementation over the Qhull library is about 10x~15x for randomly distributed points. In[9], the parallel QuickHull implementation on the GPU achieved a speedup of 11x~12x for the uniformly distributed input points. In[14], the GPU-accelerated QuickHull implementation achieved an order of magnitude acceleration on the Qhull library for both randomly and uniformly distributed points.
We think that the probable main cause that why our implementation cannot obtain high efficiency is that: the adopted underlying convex hull algorithm for parallelization is Graham scan rather than QuickHull. In sequential programming pattern, the QuickHull algorithm is much more efficient than Graham scan since QuickHull by nature can remove non-extreme points very fast. When parallelizing QuickHull on the GPU, this feature of fast removing non-extreme points can also be exploited and benefited; and therefore relatively higher efficiency can be obtained.
Although our implementation cannot achieve high efficiency, it has an obvious advantage. Compared with some of the GPU-accelerated convex hull algorithms such as those were designed on the basis of QuickHull[8,9,14], the main advantage of our algorithm is that it is very easy to implement, which is mainly due to (1) the use of the library Thrust and (2) relatively fewer data dependencies.
The first weakness of gScan is that: the efficiency of the second round of preprocessing by invoking only one thread blockmay not be the highest.
Another obvious shortcoming of our algorithm is that it cannot be applied in the case if all the input points are extreme points. In this case, the two rounds of discarding are therefore computationally wasteful; and the entire algorithm is inefficient. Therefore, the gift wrapping algorithm could be employed.
We expect to reach a significant promotion in the efficiency of our implementation by replacing the primitives provided by Thrust with those corresponding primitives given by CUB. Future work could be done that implementing our algorithm using CUB and evaluating the efficiency of its performance.
In this paper, we presented a parallel Graham scan algorithm for calculating 2D convex hulls on the GPU. The proposed parallel Graham scan algorithm is mainly composed of two stages: the first stage includes two rounds of preprocess procedures executed on the GPU; and the second stage where the calculating of the convex hull for the remaining points is finalized on the CPU. Our parallel algorithm is implemented by heavily exploiting the library Thrust. Several efficient data-parallel primitives such as parallel sorting, reduction, and partitioning provide by Thrust are used to make our implementation convenient to implement and easy to utilize. We have compared our implementation to the library Qhull using three groups of test data on two different platforms. We have also observed that: our algorithm gScan can achieve the speedups of 5x~6x on average and 6x~7x in the best cases over the library Qhull, and can calculate the convex hull of 20M scattered points in less than 0.5 second. Although our algorithm cannot be much faster than the Qhull, it is very simple and easy to implement when compared with some related work. It could be an alternative in practice.