Small objects detection in UAV aerial images based on improved Faster R-CNN

2020-04-21 00:54WANGJiwuLUOHaibaoYUPengfeiLIChenyang

WANG Ji-wu, LUO Hai-bao, YU Peng-fei, LI Chen-yang

(School of Mechanical, Electronic and Control Engineering, Beijing Jiaotong University, Beijing 100044, China)

Abstract: In order to solve the problem of small objects detection in unmanned aerial vehicle (UAV) aerial images with complex background, a general detection method for multi-scale small objects based on Faster region-based convolutional neural network (Faster R-CNN) is proposed. The bird’s nest on the high-voltage tower is taken as the research object. Firstly, we use the improved convolutional neural network ResNet101 to extract object features, and then use multi-scale sliding windows to obtain the object region proposals on the convolution feature maps with different resolutions. Finally, a deconvolution operation is added to further enhance the selected feature map with higher resolution, and then it taken as a feature mapping layer of the region proposals passing to the object detection sub-network. The detection results of the bird’s nest in UAV aerial images show that the proposed method can precisely detect small objects in aerial images.

Key words: Faster region-based convolutional neural network (Faster R-CNN); ResNet101; unmanned aerial vehicle (UAV); small objects detection; bird’s nest

0 Introduction

At present, there are three main object detection framework: Faster region-based convolutional neural network (Faster R-CNN)[1], single shot multibox detection (SSD)[2]and you only look once (YOLO)[3]. Compared with the other two object detection frameworks, Faster R-CNN usually has higher detection precision. However, Faster R-CNN method has different detection precisions for different scale objects because it is good for general scale objects, but relatively poor for small objects because it is easy to have missed detection. There are two main reasons for this problem. On the one hand, the extraction of region proposal position is not precise enough. Faster R-CNN uses an anchor mechanism in region proposal network (RPN) to generate nine region proposals of three ratios and three scales at each pixel position of the last convolutional feature map. Compared to the ground-truth box, the region proposals generated by RPN are too large for small objects, which results in an imprecise extraction of the region proposals. On the other hand, the multiple maximum pooling operation causes the small objects information of original image to be easily lost on the deep convolutional feature map. In Faster R-CNN, the maximum pooling operation is generally adopted between two or three adjacent convolutional layers, which has the advantages of effectively reducing the computational complexity of the network model and making the convolutional neural network have translation and rotation invariance to some extent. However, it also brings some problems. The resolution of the deep convolutional feature map is much lower than that of the original image, which results in the loss of the parts of original image information on the deep convolutional feature map. In addition, the region of interest (ROI) pooling in the Faster R-CNN detection sub-network uses the output feature map of the conv5_3 layer. For small objects, undergoing multi-layer pooling, the loss of feature information is severe, and sufficient information cannot be retained for subsequent classification and regression. Therefore, for the detection of small objects, it is not deeper convolution features that are more conducive to detection results.

Aiming at the existing problems of small objects detection with Faster R-CNN method, this paper presents an improved method based on Faster R-CNN. Firstly, we use the improved convolutional neural network ResNet101[4]to extract object features. Secondly, multi-scale sliding windows are used to obtain the object region proposals on the deep convolution feature maps with different resolutions. Finally, a deconvolution operation is added to further enhance the selected feature map with higher resolution, and makes a feature mapping layer of the region proposals pass to the object detection sub-network. This method provides a reliable basis for achieving the automatic detection of small objects based on UAV aerial images.

1 Design of feature extraction network

Effective extraction of the object feature is a key step in image object detection. The convolutional neural network[5-6]has strong image classification ability, because it is good at mining local features of image object data, and can greatly reduce the number of parameters to be learned by using its own unique local perception and weight sharing attributes, which can effectively improve network training performance. In this paper, the ResNet101 network architecture based on convolutional neural network is improved to complete the features extraction of the image objects. Compared with AlexNet[7], VGGNet[8], GoogLeNet[9]and other feature extraction networks, ResNet network has the following advantages:

1) ResNet normalizes the input data at each layer by adding batch normalization layer, which can effectively accelerate the convergence speed of the training and reduces the degree of over-fitting of the network model.

2) ResNet avoids the gradient dispersion problem caused by weight layer by using a new “shortcut” identical mapping network connection, which makes the network performance in an optimal state and not decrease with the increase of network depth.

3) The ResNet50/101/152 adopts the “Bottleneck design” method. As shown in Fig.1(a), by using 1×1 convolution to control the number of input and output feature maps of 3×3 convolution, the number of convolution parameters will be greatly reduced while the depth and width of the network will be increased.

In order to obtain an effective and rich object features, we propose a ResNet variant network structure by referring to the inception network structure, as shown in Fig.1(b). Such a design method enables each layer in the network to learn sparse or non-sparse features, which increases the adaptability of the network to the scale. Meanwhile, the network using two 3×3 convolutions can obtain a lager receptive field than before and also has fewer parameters.

256×1×1×64+64×3×3×32+(64+32)×

3×3×16+64×1×1×256=65 024,

(1)

256×1×1×64+64×3×3×64+

64×1×1×256=69 632.

(2)

By comparing Eq.(1) with and Eq.(2), it is found that the improved network structure has fewer training parameters. Furthermore, the subsequent experimental results show that the proposed ResNet variant structure can improve the detection precision significantly.

Fig.1 Comparison of ResNet101 network structures

2 Design of object detection network

2.1 Overall structure of object detection network

In order to solve the problem of multi-scale small objects detection in aerial images, an improved method based on Faster R-CNN method is proposed. The overall structure is shown in Fig.2.

Compared with Faster R-CNN method, the improvements of the method are as follows:

1) To solve the problem of imprecise location of region proposals for small objects, we propose a multi-scale sliding window method to obtain object region proposals on deep convolution feature maps with different resolutions, which is called multi-scale RPN (MS-RPN). According to the actual distribution of the object’s own scale, the network sets reasonable sliding windows with different sizes on different deep feature maps to generate region proposals with abundant scales on the input image, so that MS-RPN can extract more precise region proposals than RPN.

2) Aiming at the problem that the information of small objects in original image disappears on deep convolution feature maps, firstly, we prefer to use the improved ResNet101 network structure to extract object features; then we select ResNet_4w convolution feature map with appropriate depth and high resolution as feature mapping layer of the region proposals, and add a deconvolution operation to further enhance the resolution of this feature layer; finally, the region proposals generated by MS-RPN we pool into a fixed-size feature map by using ROI pooling operation, and then feed it into ResNet_5c convolution layer to extract the object features once more before the final detection is achieved.

Fig.2 Overall framework of small objects detection network

2.2 MS-RPN

The structure of MS-RPN is shown in Fig.3. In order to consider various-scale objects, especially small objects, we set reasonable sliding windows with different sizes on the deep convolutional layers ResNet_3d, ResNet_4f and ResNet_5c, respectively, and the region corresponding to each sliding window is mapped to the input image as a proposal window. The subsequent classification and regression processes are consistent with those of the classic RPN network. For ResNet_3d, because the resolution of this convolutional feature layer is lower and its response to small objects is stronger than that of other deep convolution feature layers in MS-RPN, so it is mainly used to extract the region proposals for the small objects in the input image. Considering the detection speed, we use the sliding windows 5×5 and 7×7 respectively on this convolution feature layer, and set the step size of the sliding window to be 2; For ResNet_4f, it is mainly for normal size objects, besides using the sliding windows of 5×5 and 7×7 an additional 9×9 sliding window is added, and all sliding windows have a step size of 1. For ResNet_5c, the sliding windows 7×7, 9×9 and 11×11 are used, respectively, and the sliding step is also set to be 1. Finally, the experimental results show that the proposed MS-RPN network can keep a high recall rate for small objects in UAV aerial images.

Fig.3 MS-RPN structure

2.3 Network loss function and training details

In order to train the MS-RPN network, it is necessary to label the region proposals corresponding to each sliding window. We assign a positive label to two kinds of region proposals: (i) the region proposal with the highest intersection-over-union (IoU) overlap with a ground-truth box, or (ii) the region proposal that has an overlap higher than 0.5 with any ground-truth box. We assign a negative label to the region proposal if its IoU ratio is lower than 0.2 for all ground-truth boxes. The region proposals being neither positive nor negative does not affect the training objective. The total loss function of MS-RPN network refers to the calculation method of RPN loss function in Faster R-CNN. Because the selection of the region proposals in this paper comes from different convolution layers, its calculation method is slightly different from that of RPN. The specific calculation method is expressed as

(3)

whereMis the number of convolution layers participating in the region proposals, which is 3;wmis the sample weight corresponding to each convolution layer;Smis the sample set extracted for each convolution layer, which is 128;lmis the loss function of any convolution layer in MS-RPN, which includes classification loss function {pi} and regression loss function {ti}. The whole object detection network uses back propagation and random gradient descent to train end-to-end.

Considering the network training stage, the large number of negative samples and uneven distribution will have a great impact on the final network model detection precision, therefore, we use the IoU value between the region proposal and the ground-truth box to rank all negative samples, and then select some samples with higher IoU value as negative samples to join the training set.

For an input image with a resolution of 1 000×600, the region proposals are extracted by the MS-RPN method, and about 12 000 region proposals are obtained. However, there will be a lot of overlap among the region proposals, which seriously affects the detection speed. Therefore, based on the confidence value of the region proposals, we use the method of non-maximum suppression (NMS) to select some high-quality region proposals. The IoU threshold is set to be 0.7. After performing the NMS operation, there are only about 1 000 region proposals left for each image. Subsequently, 100 regions with the highest confidence level are selected from the remaining 1 000 region proposals as the final region proposals, and then processed by ROI pooling operations and finally sent to the subsequent convolution feature extraction layer and object detection sub-network.

Meanwhile, in order to reduce the training parameters and accelerate the detection speed of the network, the global average pooling method is used to replace the full connection method in the detection part to achieve the classification judgment and the bounding box regression of the object.

3 Experimental results and analysis

3.1 Building object data set

In our work, the research object is the bird’s nest on the high-voltage tower in UAV aerial images, which is used to verify the proposed method.

In order to enrich the image training data set. We use image enhancement techniques (image flipping, image rotation, increasing image contrast and gaussian noise, etc.) to expand the image data set. Besides, the sample database contains not only the images with common complex background, but also contains the images with severe interference by illumination, coverage and haze. Then, the sizes of all image samples are scaled uniformly to 1 000×600, and the position and label of the bird’s nest in the image are marked respectively to make it conform to the standard data set format of Pascal VOC. At last, the images in the sample database are divided into two groups according to the ratio of 3 to 1. The sample number of training set is 9 000, and sample number of the test set is 3 000.

3.2 Experimental results and analysis on object test set

The object recall rates under different IoU thresholds are used as evaluation criteria by referring to Ref.[10]. The MS-RPN region proposals method is compared with the RPN region proposals method in Faster R-CNN on the constructed bird’s nest data set. From Fig.4, we can see that both RPN and MS-RPN have a high recall rate when the threshold is set between 0.5 and 0.7. But when the threshold exceeds 0.7, MS-RPN still has an ideal recall rate, while the recall rate corresponding to RPN decreases sharply. The results show that the MS-RPN method is more precise than the RPN method. There are two main reasons: Firstly, the region proposals selected by the RPN method are not precise enough for small objects; Secondly, the RPN method generally extracts the region proposals from the last deep convolution feature map. Because the resolution of this convolution feature layer is low, its detection ability for small objects is limited. But in our work, we set reasonable sliding windows with different sizes on different deep convolution feature maps according to the scale distribution of the research objects, so that the region proposals are extracted with higher precision.

Fig.4 Comparison of recall rates under different IoU thresholds

Table 1 compares the performance of the proposed method with those of the best traditional object detection methods including deformable parts model (DPM), Faster R-CNN VGG16 and Faster R-CNN ResNet101 on the same bird’s nest data set.

Table 1 Comparison of bird’s nest detection results on test set

DetectionmethodTest datasetmAP (%)Miss rate (%)Speed(frame/s)DPM3 00042.3439.7715Faster R-CNN VGG163 00071.4018.739Faster R-CNN ResNet1013 00073.3513.577Proposed method3 00085.556.6510

Fig.5 Example detection results of bird’s nest on test set

The tests were carried out on Nvidia Titan X. Compared with DPM object detection method, the proposed method has nearly doubled the detection precision, but the detection speed is slightly slower. Compared with Faster R-CNN VGG16 and Faster RCNN ResNet101, the detection mAP is improved by 14.15% and 12.2% respectively. Besides, the miss rate is reduced by about two-thirds and half respectively, which fully verifies the significant advantage of this method in small objects detection. Meanwhile, the detection speed of this method is slightly faster than the other two Faster RCNN method mentioned above. Fig.5 shows some bird’s nests detection results of this method on some test set.

In order to further verify the detection performance of the proposed method, network model decomposition experiments were carried out on the object data set, and the effects of various network design methods proposed in this paper on the detection results are analyzed concretely. The experimental results of network model decomposition in Table 2 show that the mAP of bird’s nest detection will decrease by 6.4% if MS-RPN is not used to extract the region proposals; And the improved ResNet-101 network structure can improve the mAP by 3.6%; Finally, the deconvolution operation can improve the mAP by 2.2%.

Table 2 Comparison of experimental results of network model decomposition

ProjectDetection resultsImproved ResNet101✕√√√MS-RPN√✕√√Dec-Conv√√✕√mAP (%)81.9579.1583.3585.55

3.3 Experimental results and analysis on VOC0712 object test set

In order to verify the generality of the proposed method, it is also tested on VOC0712 data set, and the test results are compared with those of DPM method, Faster R-CNN VGG16 and Faster RCNN ResNet101. Four methods were carried out on the same training set and test set, respectively. The training set is composed of VOC2007-train and VOC2012-train, and the test set is VOC2007-test. Table 3 shows the test results of four methods on some detected objects. It can be seen from Table 3 that the last three detection methods based on convolution neural network have better detection precision than traditional DPM method on all scales of objects because the convolution neural network has the powerful function of automatic learning to extract object features. The proposed method is basically the same as Faster R-CNN VGG16 and Faster RCNN ResNet101 in detecting large objects such as airplanes and cars. The main reason is that most of the VOC2007 data set is composed of large objects. But the proposed method has obvious advantage on small objects such as birds, bottles and plants, and the overall detection precision has been improved by nearly 10%. Meanwhile, the detection precision of Faster R-CNN ResNet101 is slightly higher than that of Faster R-CNN VGG16 on various scale objects because with the increase of network depth, the better object features can be obtained. To sum up, the results just reflect that the proposed MS-RPN network can obtain a higher quality region proposal because the objects in UAV aerial image are generally much smaller than the whole image, therefore the proposed method is reasonable and feasible.

Table 3 Comparison of detection results on VOC2007 test set

4 Conclusion

This paper presents a general multi-scale small objects detection method based on Faster R-CNN for UAV aerial images. The experimental results show that the proposed method can precisely detect small objects in aerial images, and the detection precision is higher than those of those of best traditional method and other Faster R-CNN methods, and the speed is also slightly faster than those of other Faster R-CNN methods.