SUN Zhao, MA Chao,*, WANG Liang, MENG Ran, and PEI Shanshan
1. Faculty of Information Technology, Macau University of Science and Technology, Macau 999078, China;2. Baidu’s Intelligent Driving Group, Beijing 100193, China;3. Beijing Smarter Eye Technology Co., Ltd., Beijing 100023, China
Abstract: An obstacle perception system for intelligent vehicle is proposed. The proposed system combines the stereo version technique and the deep learning network model, and is applied to obstacle perception tasks in complex environment. In this paper, we provide a complete system design project, which includes the hardware parameters, software framework, algorithm principle, and optimization method. In addition, special experiments are designed to demonstrate that the performance of the proposed system meets the requirements of actual application.The experiment results show that the proposed system is valid to both standard obstacles and non-standard obstacles, and suitable for different weather and lighting conditions in complex environment. It announces that the proposed system is flexible and robust to the intelligent vehicle.
Keywords: intelligent vehicle, stereo matching, deep learning,environment perception.
The obstacle perception is the primary task of intelligent vehicles, which focuses on 3D environment around the host vehicle. In recent years, three major types of sensor systems can be found in the literature: LIDAR-based system [1-3], radar-based system [4,5] and vision-based system [6-8]. Compared with LIDAR and radar, the camera is not motion artifact and can generate dense disparity information, which overcomes the shortage of LIDAR and radar. Stereo vision is widely used for the intelligent vehicle or advanced driver assistance system. At present,binocular stereo vision is becoming the trend in this field [9].
In this paper, we propose a stereo vision perception system in a low-power mobile platform, which consists of a binocular camera and an integrated computing unit of graphic processing unit (GPU) and the central processing units (CPUs). A multi-scale fast stereo matching algorithm is proposed to calculate disparity information and a compressed deep learning network is adopted to improve obstacle perception. Besides, a special fusion strategy makes the proposed system more robust in complex environment.
We propose a multi-scale fast stereo matching algorithm that is improved from the multi-path-Viterbi (MPV)algorithm [10,11] to generate dense disparity information.Alternative regions of obstacle are proposed by the road model and v-disparity space. A novel deep learning network is used for obstacle perception, which has been simplified with the special compressing method.
The proposed multi-scale fast stereo matching algorithm includes three parts: multi-scale image pyramid space, calculating matching cost and dynamic optimization.
Based on the improved stereo matching algorithm, we propose an unsupervised method to detect the road and obstacle through adding an extra Viterbi process on v-disparity space. In the obstacle perception task, both the category and the location are necessary. We carry out the localization of objects based on disparity space.
In recent years, deep learning network has made great success on target recognition and detection. We adopt a novel convolutional neural network (CNN) to identify and classify alternative regions. With the network compression [12,13] technique and the transfer learning method, the computing cost of the CNN is reduced.
In this paper, we adopt a heterogeneous system integrated CPUs-GPU mobile platform [14] that is a hybrid programming [15]. In the proposed system, GPU is assigned to calculate the stereo matching algorithm and the deep learning network module, while CPUs are used to detect the road and the alternative region.
The proposed system is built on the compute unified device architecture (CUDA) which is a parallel programming model and software environment launched by NVIDIA [16]. The CUDA takes fully advantage of GPU to implement large-scale parallel computing, and the highperformance instructions are provided by this programming language [17].
We present a robust binocular perception system which adopts the stereo vision algorithm and the deep learning network. The contributions are as follows.
(i) On the basis of the low-power mobile platform, we propose a completed perception system hardware design and software solution.
(ii) A multi-scale fast stereo matching algorithm is proposed for disparity information which is used for the road detection and region proposal.
(iii) A deep learning network is adopted as a feature detector and the network is simplified by a compression method.
(iv) A special fusion strategy is valid to combine the stereo vision technique and the deep learning network to improve the perception result in complex environment.
(v) The distributed computing technique improves the computing performance of the system for heterogeneous system integrated CPUs-GPU.
The rest of this paper is organized as follows. In Section 2, we present the system design. In Section 3, we explain the principle of the proposed multi-scale fast stereo matching algorithm and the feature detector based on the compressed deep learning network model. Besides, we introduce a special strategy to combine the stereo vision and the deep learning. Experiment and analysis are provided in Section 4 to support that the proposed system is valid. Final conclusions are drawn in Section 5.
In this section, we introduce hardware design and modules design. The proposed system is mainly composed of three functional modules: stereo matching, alternative region detection and obstacle perception, as shown in Fig. 1.
Fig. 1 System working flow
The proposed system consists of the camera module (binocular cameras) and the computing unit (including the main board and the power system), as shown in Fig. 2.
Fig. 2 System hardware structure
The baseline of the binocular camera is 120 mm, the focus length of lens is 6 mm and the resolution of sensor is 640 pixel × 320 pixel. The depth of field (DOF) is from 3 m to 60 m and the field of view (FOV) is 40° in horizon, as shown in Table 1.
The computing unit is a NVIDIA Kepler “GK20a”GPU with 192 SM3.2 CUDA cores and a NVIDIA Cortex-A15 CPUs with Cortex-A15 battery-saving shadowcore. In order to obtain the best observation, the system is installed at the top center of the vehicle’s windshield as shown in Fig. 3. The power supply is provided by the vehicle, and the system is connected with the vehicle through the on board diagnostics (OBD) interface.
Table 1 System parameters
Fig. 3 Assembly position
Stereo matching and obstacle perception based deep learning are implemented at GPU, while alternative region detection runs at CPUs. Besides, the main CPU undertakes the dispatching instruction task, as shown in Fig. 4, where the ISP is image signal processing and eMMC is embedded multi-media card.
Fig. 4 Modules design
Stereo matching is to evaluate the disparity of spatial corresponding points between left and right images. We refer to the MPV algorithm [6,11] and suggest an improved method to adapt the low-power mobile platform. The referring algorithm has the following characteristics.
(i) A bi-directional Viterbi algorithm is adopted to decode the stereo matching cost, and a special strategy is proposed to merge the paths on hierarchical space to further decrease the decoding error.
(ii) Structural similarity (SSIM) is used to calculate the disparity value of left and right images at epipolar lines.
(iii) We employ a fast calculation technique to find the best Viterbi path, and we employ a multi-scale method[18,19] to improve the calculation speed with only a small loss of accuracy.
3.1.1 MPV algorithm
The MPV algorithm includes two parts: the disparity estimation by a Viterbi process and the path-merging strategy by four bi-directions (horizontal, vertical, and two diagonal) Viterbi paths on the matching space.
For the first part, we adopt the SSIM algorithm to a cost function which is defined as
where α, β and γ are parameters to define the proportion of the corresponding parts, u is the disparity value andp is the pixel coordinate in the left image. The above three components of (1) respectively represent the luminance,contrast and structure similarity measures, which are defined as
For the second part, the total variation (TV) constraint is taken advantage to constrain the disparity in all Viterbi paths, which is defined as the energy E(U) on the disparity map U as
where ε is the TV modified constraint that penalizes all t he disparity changes between p and p′, where p′has disparity u′and belongs to p′sneighborhood Lpas
where G is the gradient, and λ is the tradeoff parameter to balance the TV term.
Based on this, the stereo matching solution is transformed into finding the disparity map U to minimize the energy function E(U). The Viterbi algorithm is adopted to approximate the optimum solution [21].
We use four bi-directional Viterbi paths on the matching space. Compared to other directions, horizontal directions have stronger constraints. Therefore, the optimum paths of other directions are calculated based on the results of horizontal directions. The node ’s energy is updated by the bi-directional Viterbi algorithm in each layer.The MPV algorithm flow is shown as follows.
Algorithm 1MPV algorithm
InputPrevious pixel
OutputNew Viterbi energy
Step 1Compute the horizontal direction Viterbi algorithm (left and right).
Step 2Compute the vertical direction Viterbi algorithm (up and down).
Step 3Compute the 2nd diagonals direction Viterbi algorithm (right down and left up).
Step 4Compute the 3rd diagonals direction Viterbi algorithm (left down and right up).
3.1.2 Multi-scale fast matching
We propose a multi-scale fast disparity estimation method.Inspired by [22,23], two main proposals are suggested:scale pyramid and multi-scale disparity propagation.
The top layer of the scale pyramid is the original image, and each lower layer is sampled based on the upper image, as shown in Fig. 5. We calculate the matching value of the full range at the last layer, and the result of the stereo matching is used to initialize the previous layer.In this paper, we suggest a three-layer scale pyramid method, and the specific principle is as follows.
Fig. 5 Scale pyramid
Firstly, the three-layer scale pyramid is constructed with two down-sampling. That is, the top image (the 1st layer) is the original size image, the middle image (the 2nd layer) is a quarter of the original image size, and the bottom image (3rd layer) is one sixteenth of the original image size.
Next, the 3rd layer image is split into blocks of the same pixel size. Based on these blocks, the MPV algorithm is implemented in this layer. With a random initialization, the stereo matching method will calculate the disparity value of blocks. Each pixel’s disparity is assigned as the block’s which includes this pixel. Because large scale feature exists in the 3rd layer, the size of the block should not be too small.
Then, the disparity of the 3rd layer is propagated to the 2nd layer as an initialization. However, the disparity map of the 2nd layer is four times as large as the 3rd layer.The corresponding position in the 2nd layer will be initialized by the 3rd layer and another pixel is initialization by the neighborhood pixel. We split the 2nd image to smaller blocks than the 3rd layer. Another suggestion is that the closer to the original image, the smaller to the block size. Again, the stereo matching algorithm will give a new disparity value to the 2nd layer.
Besides, the matching scope is not consistent in each layer. Assuming the full search scope is d at the original image layer, it becomesat the 3rd layer. Furthermore,with the up-sampling of the disparity map, the scope of the upper layer is half of the lower layer.
Finally, the disparity is propagated to the top layer.Pixels are split into blocks, and we initialize them as before. The stereo matching algorithm will calculate the final disparity information at the 1st layer.
We transform the disparity map U into v-disparity space and u-disparity space. Hx{U}(i,j) denotes the number of points with the disparity as i at the jth line of the horizontal direction in the disparity map. In the same way,Hy{U}(i,j)corresponds to the same meaning as
where δ denotes the Kronecker delta. Generally, we can detect the road model [6,11] in Hx{U} and segment the alternative region of the obstacle [24] by both Hx{U} and Hy{U}in the disparity map U. As shown in Fig. 6, the alternative regions are extracted. The purpose is to prepare for target recognition by the deep learning model.
Fig. 6 Alternative region
Binocular vision is an unsupervised environmental perception method. Through a reasonable disparity space detection method, it can quickly and accurately locate and detect obstacles. This is an important advantage for the proposed system. In the complex environment, the proposed system is sensitive to both standard obstacles and non-standard obstacles.
For target detection, the most popular end-to-end network is CNNs. Lots of excellent methods have been developed,such as single shot detector (SSD) [25], you only look once (YOLO) [26] and faster region CNN (R-CNN) [27].
We have proposed the alternative obstacles region based on the disparity map, so that a visual geometry group (VGG)-Net [28] model is advised as a feature detector. The VGG-Net is flexible in target recognition due to its superior performance on generalization and transfer of learning.
Considering the computing performance, we refer to the VGG-16 networks and suggest a compression method[12,13] to the simplified model in the forward propagation stage. Our purpose is to improve the operation speed and reduce the memory cost without significant performance degradation. An off-the-shelf pre-trained VGG-Net parameters model is employed for transfer learning and supervised fine-tuning [29].
3.3.1 Feature detector based on deep learning model
The proposed system employs a compression CNNs model as a feature detector, as shown in Fig. 7, where Conv3-64 means a convolutional layer has 64 filters with the window size of 3×3, maxpooling means a pooling layer with 2×2 filter, and FC means a fully-connected layer.
Fig. 7 Detector’s configuration based on VGG-Net at the second block diagram
The original VGG-16 model fixes the input size as 224×224×3and uses two or three nonlinear rectification layers with small receptive fields. Referring to this model, the proposed feature detector model consists of 13 convolutional layers, 2 fully-connect layers and 5 pooling layers. In that way, we can initialize parameters with the pre-trained VGG-16 model. For learning faster [30],we use smoother non-linearities neural networks which are referred as rectified linear units (ReLU) asf(z)=max(0,z).
For standard obstacles, we focus on vehicles, pedestrians and bikes (including bicycles and motorcycles). The number of nodes at the last fully-connected layer depends on categories. Furthermore, we consider a dropout regularization [31] with a probability of 0.5 following the first fully-connected layer.
The output of the feature detector is a 1 ×3 vector. We normalize this vector so that elements represent probability [32]. In training, we transform the probability vector into a one-hot vector to optimize the weights of the proposed model. However, we keep the probability vector in feature detecting. This probability vector will be sent to a special strategy for classification.
3.3.2 Classification strategy
The 1×3 probability vector represents three categories:vehicles, pedestrians and bikes. We only consider the two maximum probabilities. The ratio of the maximum to the sub maximum is compared with a priori threshold. If the ratio is greater than the priori threshold, the category of the maximum probability is considered as the obstacle category. Otherwise, the obstacle is not considered as any category. Therefore, the proposed system will output four kinds of obstacles in the final result, which are vehicles,pedestrians, bikes and others.
The advantage of the strategy is that the proposed system can not only recognize the standard obstacles but also be sensitive to non-standard obstacles. Therefore, it is valid in the complex environment.
3.3.3 Training and compression
When training the proposed deep learning model, we implement the detector into two stages, transform training[33] and fine-tuning [34]. At first, we apply pre-trained weight of convolutional layers to train the modified fullyconnected layers with our private data sets. Then, the latest two convolutional layers are trained for fine-tuning.We adapt the stochastic gradient descent (SGD) method with a learning rate of 10-4. The cross-entropy loss function is used by both of them.
Furthermore, 20 images are selected at random during each training in 150 iterations. The training data consists of three categories: vehicles, pedestrians and bikes. Each category includes 1 000 images. The training data is resized into the same size and manually labeled categories as shown in Fig. 8.
Fig. 8 Training sample
In order to deploy the deep learning model on lowpower mobile platform, we employ efficient compression methods to decrease the parameters and reduce the memory. Referring to [12,13], we use the proposed entropy-based channel selection metric to evaluate the importance of each filter and prune several weak filters. As a result, the pruned model gains about 3× to 4× speed-up both in training and inference stages with a compression rate of 50%.
We pick 16 feature response results at random showing in Fig. 9. By contrast, we find that the feature responses of pedestrians and vehicles to the same convolution kernel are almost the same in shallow networks.These convolution kernels reflect basic structural information, which are owned by both pedestrians and vehicles.However, in the deep network, the response to the same convolution kernel is different. These features are not directly from the image, but from the previous layer. They are regarded as the concentration and generalization of the shallow network, so that the feature information from the deep network is more abstract and flexible. They are more robust and accurate to reflect the comprehensive features of the object.
Fig. 9 Comparison of feature response
We employ distributed computing and refer to the robust software partitioning technique [35]. In the proposed system, the GPU is used in the multi-scale fast stereo matching algorithm and the deep learning model with employing the CUDA programming [36], while the CPUs execute instruction scheduling and alternative region detection with employing the parallel computing method [22].A main CPU assigns tasks to other CPUs and GPU with instruction scheduling as shown in Fig. 10.
Fig. 10 Distributed computation
The data flow is transferred between memory and processors (CPUs and GPU). The memory is responsible for storing processing data and final results, while processors execute functional modules. The different lines mean the data flow on bus, and arrows indicate the direction of propagation.
The technique of distributed computation is detailed as follows.
Step 1Left and right images are sent from camera and stored in shared memory. The CPU instructs the GPU to fetch images to calculate the stereo matching [37] and to push the disparity map into shared memory.
Step 2The CPU undertakes the road detection and the alternative region extraction based on the disparity map and sends results into shared memory.
Step 3The alternative region is assigned to the GPU,where a deep learning model is carried out to extract features. These features are also stored in shared memory.
Step 4According to features, the CPU classifies objects in the alternative region and marks the location in the original left image [38,39].
The data flow is shown in Fig. 11.
Fig. 11 Data flow
Shared memory is managed by the scheduling system,which supports simultaneous access of the process. Generally, threads access the buffer by using memory address.When more than two threads access the same buffer at once, the system lock is valid to control access by taking a certain time. For a 640 pixel × 320 pixel resolution image, the memory copy speed is less than 0.3 ms, which is determined by testing. Therefore, the maximum number of memory copies is 2×7+1×2=16 in each independent processing, which is from the image input to the final obstacle detection output.
In this section, we focus on evaluating the proposed system by experiments and giving comment for the results.Firstly, we evaluate each module by different experiments.Then we compare the proposed system’s result with other methods in obstacle perception. Finally, we verify the performance of the proposed system.
At first, we show the result of the multi-scale fast MPV algorithm. We apply the pseudo color to render the disparity map where the cold color represents the low disparity value (distance) and the warm color represents the high disparity value (proximity). We install the proposed system to a vehicle and test the proposed improved stereo matching algorithm in practice. As shown in Fig. 12,we select the following eight images as examples to show the result of the multi-scale fast MPV algorithm. The left and the middle are the single channel images from the binocular camera and the right is the disparity map rendered by the pseudo color. The result shows that the improved stereo matching algorithm has a high robustness and accuracy.
Fig. 12 Stereo matching
Then, continue frames of disparity maps are employed to fit the road model by v-disparity space, as shown in Fig. 13.The red line represents the road model described by the disparity value and the image rows’ index by [40]. Then the alternative region is segmented by both the road model and the disparity map.
Fig. 13 Road model
Next, the alternative regions are extracted from the left image and normalized to 224×224×3. Finally, these normalization pictures are fed into the deep learning model as input data. The segmented alternative regions are shown in Fig. 14, where the red boxes are the minimum external rectangle of the segmented regions and named alternative regions.
Fig. 14 Alternative region
The above alternative regions are arranged according to the ascending order of average disparity. According to the order, they are sent to the deep learning model to extract the features. We pay more attention to obstacles with large disparity, because they are closer to us.
Generally, shallow convolution layers tend to extract more basic structural features of the image, such as edge and color. As shown in Fig. 15, we show the response result of all 64 convolution kernels of the first convolution layer, in which the highlight area indicates strong response and the dark area indicates weak response. We find that most of the strong response occurs at the edge of the object or in the pure color part.
Fig. 15 Response results in the first convolution layer (64 feature maps)
In contrast, the extracted features are more abstract in the deep convolution layer. As shown in Fig. 16, it is the last convolution layer before the full connection layer.There are 512 convolution kernels responding results, but they cannot describe their characteristic meanings through intuitive understanding.
Fig. 16 Response results in the last convolution layer (512 feature maps)
Finally, the feature detector will output a 1×3 probability vector. We use different color boxes to distinguish different classification results, where red is for the vehicle,yellow is for the pedestrians, green is for the bikes and blue is for the others as shown in Fig. 17.
The test set consists of 300 images, in which each image contains one obstacle at least and not more than ten.There are 387 vehicles, 130 pedestrians, 182 bikes and 288 other obstacles in the test set. The accuracy rate of obstacle recognition is about 89.81%. We also count the precision and recall rates of each category as shown in Table 2, where the row title is the ground truth (Label),the column is the observation value (Result), and the middle matrix means that the number of the observation value is Result with the ground truth being Label. The vehicle is 92.86% for precision and 90.70% for recall, the pedestrian is 82.44% for precision and 83.08% for recall, the bike is 90.59% for precision and 84.64% for recall, the others are 82.14% for precision and 87.85% for recall.
Table 2 Result of deep learning model
We compare the perception result of the proposed system and advanced deep learning networks in the complex environment, such as SSD, YOLO and faster R-CNN. We select SSD and faster R-CNN models based on VGG network and YOLO v3 models as the control group. Besides,the test set in Subsection 4.3 is also used.
To facilitate comparison of results, we revise the final output of these deep learning networks. For example, we choose the faster R-CNN model which has been pretrained on the ImageNet dataset. The deep learning model provides two results for each obstacle: the proposal and the category. We consider the proposal as the final region and top 10 categories for further processing by a special strategy. If the vehicle related categories exist in top 10, such as school bus, sport car, fire truck, dustcart,pickup and so on, their confidence degree will be summed and normalized in top 10. Then it is set as the Vehicle in the 1×3 probability vector mentioned in Subsection 4.3.In the same way, we can sort out the Pedestrians and Bikes.
By the above strategy, the faster R-CNN model could be suitable for comparative experiment. Its results are analyzed by strategy in Subsection 3.3. The SSD and YOLO are revised in the same way.
As shown in Table 3, standard obstacles, such as Vehicles, Pedestrians and Bikes, do not display significant difference in the result of detection. The end-to-end deep learning networks take the slight advantage in precision, while the proposed system takes the advantage in recall. The significant gap is mainly reflected in Others.We find that end-to-end deep learning networks pay their attention to standard obstacles, such as trees, traffic signs and trash cans in Others, and they are insensitive to nonstandard obstacles, such as barricades, barriers and walls.This is because that the pre-trained end-to-end deep learning networks are sensitive to categories which exist in ImageNet. Therefore, these models will be invalid for obstacles that do not exist in train sets. In the complex environment, it is unsafe for non-standard obstacles with low recall. However, the proposed system effectively overcomes this shortage, as it depends on the disparity map to provide the proposal rather than the region proposal network (RPN).
Besides, we also have distinguished distance of standard obstacles as follows: the Vehicles are divided into two control groups, near and far, with 50 m as the boundary; the Pedestrians and Bikes are with 30 m. As shown in Table 4, the detection results are similar in the near place.In the far, the proposed system is better than end-to-end deep learning networks. As distant obstacles occupy the small region in test images, the RPN is not sensitive to them. However, the proposed system is valid in that case.
Table 3 End-to-end deep learning networks for obstacle detection %
Table 4 Detection result in different distances %
Above all, the detection result of the proposed system is close to the end-to-end deep learning networks for standard obstacles. For the non-standard obstacles and in the far, the proposed system shows great advantages,which is a very precious feature for the application of intelligent driving in the complex environment.
The proposed system consists of two main parts: stereo matching and obstacle detection. We pay attention to the operation efficiency and system performance. Therefore,three experiments are designed to evaluate the efficiency of modules and the perception performance of system. At first, we compare runtime between origin algorithms in references and the proposed algorithms of the proposed system.
As shown in Table 5, the improved algorithms are better than origin algorithms. For the stereo matching algorithm, the proposed multi-scale Viterbi algorithm gains more than 2.5× speed-up as expected. The obstacle perception module accelerates about three times. In addition,we repeat the above experiment at different GPU frequencies. The results show that the proposed system is better,and its runtime is about 4.5 fps at 852 MHz GPU frequency.
Table 5 Runtime of modules
Besides, the image changes slightly in different weather and light conditions where system performance is one of the important concerns for automatic driving. We collect six test sets in different field environments, which are sunny, light rain, heavy rain, snowy, night and backlight.As shown in Table 6, the proposed system gains a good detect result. In bad weather and light conditions, such as heavy rain and backlight, the disparity map of the system will deteriorate so that the final perception rate is reduced.
Table 6 Weather and light test %
Furthermore, the effect of ambient temperature is shown in Table 7. The proposed system can work at ambient temperature of 20 ℃ to 60 ℃ and the power of full load is no more than 10 W. For protecting the chip, the system is set 95 ℃ as a power of threshold.
The full load power of the proposed system is about 10 W which shows that this environment perception system based on binocular vision is suitable for the low-power mobile platform. The proposed system proves that under the existing mature technology, it can initially meet the commercial needs of intelligent driving or assisted driving.
Table 7 Effect of ambient temperature
A robust perception system is proposed in this paper,which can be applied to the intelligent vehicle or the advanced driver assistance system. There are four major innovations in this paper. Firstly, we deploy the proposed system in a low-power mobile platform and present completed system design. Then, a multi-scale fast stereo matching algorithm is proposed to provide a dense disparity information that is used to detect the alternative region of obstacles. Next, a compressed deep learning network is employed as the feature detector. Finally, a fusion strategy is valid to combine the stereo vision technique and deep learning network in order to improve the perception result in the complex environment. Therefore,the proposed system is valid to deploy in the low-power mobile platform. Furthermore, our future work will focus on the compression and the recognition improvement of the deep learning model.
Journal of Systems Engineering and Electronics2021年1期