ERK'2019, Portorož, 272-275 272 Comparison of Semi-Global Block Matching Algorithm and DispNet Neural Network on KITTI Data Set Tin Kramberger 1 , Boˇ zidar Potoˇ cnik 2 1 Tehniˇ cko veleuˇ ciliˇ ste u Zagrebu, Vrbik 8, Zagreb, R Hrvaˇ ska 2 Fakulteta za elektrotehniko, raˇ cunalniˇ stvo in informatiko, Koroˇ ska cesta 46, 2000 Maribor E-mail: tin@tvz.hr, bozidar.potocnik@um.si Abstract Disparity estimation is a challenging task with numerous real-world applications. There are two main approaches to this problem: the classic Semi-global block matching algorithm and a new approach using a trained convo- lutional neural network to estimate the disparity. This paper shows the advancement of disparity estimation in terms of accuracy. The accuracy of disparity estima- tion by using the Semi-global block matching algorithm (SGM) and trained DispNet convolutional neural network are assessed and compared in this paper. Results on the KITTI test data set show better performance of the Disp- Net neural network in terms of accuracy compared to SGM. It could be said that neural networks are taking the primacy by disparity estimation. 1 Introduction Stereo reconstruction is one of the most significant re- search topics in the area of computer vision. To recon- struct a scene from a stereo camera, one first needs to create a disparity map. The disparity estimation is a tech- nique which converts a pair or more images into a depth map to estimate the distance from a stereo camera. It is used in the 3D reconstruction, depth prediction, au- tonomous driving, scene understanding, etc. In the past, when hardware was somewhat lacking computational po- wer, different mathematical approaches were used. They included three steps: feature extraction, matching cost aggregation, and disparity computation [1], [2]. The Semi- global block matching algorithm (SGM ) [1] is often am- biguous and offers wrong matches due to reflection, noise, occlusions etc. Nowadays, as computer hardware is rapid- ly evolving, the hardware is becoming less and less an obstacle for researchers. Accordingly, the use of deep convolutional neural networks becomes much more con- venient. The training of the neural network is compu- tationally still the most expensive part of the process [3]. Deep neural networks are often used to generate disparity maps using additional components that do not belong to the domain of neural networks. Often, classical process- ing components are used to improve the performance of neural networks for a specific task in hand, e.g. GA-Net [4]. There are also deep convolutional neural networks that use non domain specific layers that were specifically created for the disparity map problem such as DispNet [5], [6]. This paper compares a classical Semi-global block matching algorithm (SGM) and the DispNet neural net- work by disparity map estimation. An accuracy of ob- tained disparity map by each method is assessed and com- pared. The KITTI dataset [6], [7], [8] was chosen for benchmarking SGM and DispNet due to the acknowl- edgement in research community and a decent develop- ment kit. 2 Semi-Global matching method The Semi-Global matching method (SGM) is based on pixelwise matching of Mutual Information and approxi- mation of a global, 2D smoothness constraint by combin- ing many 1D constraints. Input images have to have a known epipolar geometry, but it is not required for them to be rectified [1]. SGM algorithm determines the dispar- ity using a procedure that first computes match cost, then the aggregation cost, and, finally, the disparity. Match cost is computed for base pixelp from its in- tensityI bp and presumed correspondenceI mq with pixel q = ebm(p;d) in matched images. Theebm(p;d) func- tion symbolizes the epipolar line, wherep is a pixel and d is a line. The pixelwise cost can be computed in two different ways: using the Birdfitch and Tomasi method and the Mutual Information method (MI) [9], [10]. Bird- fitch and Tomasi compute cost CBT(p;d) as the abso- lute minimum of intensity difference between pixels p and q = ebm(p;d) in a range of half pixel in each di- rection of an epipolar line. MI cost is not susceptible to illumination changes and recording. This cost is defined as an entropy of two im- ages and their mutual entropy (1). The Equation (1) works on full images and, theoretically, requires the disparity image [1]. MI I1;I2 =H I1 +H I2 H (1) Entropy can be calculated from the probability distribu- tions P of colour intensities i of the associated images which can be perceived from Equations (2) and (3). In- tensity refers to the amount of light or the numerical value of a pixel. H I = Z 1 0 P I (i)logP I (i)di (2) 273 H I1;I2 = Z 1 0 Z 1 0 P I1;I2 (i 1 ;i 2 )logP I1;I2 (i 1 ;i 2 )di 1 di 2 (3) For well registered images of the same scene, the joint entropyH I1;I2 in Equation (3) is low, because one image can be predicted by another image resulting in a low in- formation that increases Mutual Information (MI). In the Equations (3) and (4),P I1;I2 is the joint probability distri- bution of corresponding intensities . In Equations (4) and (7), the operatorT is defined in a way that if its argument is true it returns 1, and 0 otherwise. P I1;I2 (i; k) = 1 n X p T[(i; k) = (I 1p ; I 2p )] (4) Kim et al. [11] used the Taylor expansion to transform the calculation of the joint entropyH I1;I2 into sum over pix- els (see Equation (5)). The entropy calculation is imple- mented using convolution in Equation (6), where h I1;I2 is calculated from the joint probability distribution us- ing Equation (4). The number of corresponding pixels is n, and a convolution using 2D Gaussian is denoted by g(i;k). H I1;I2 = X p h I1;I2 (I 1p ; I 2p ) (5) h I1;I2 (i; k) = 1 n log(P I1;I2 (i;k) g(i;k)) g(i;k) (6) After computing match cost, the aggregation cost must be computed. The SGM algorithm in contrast to the clas- sic block matching (BM) algorithm, in addition to the pixel-wise cost calculation also adds a smoothness con- straint which penalizes changes at neighbour’s dispari- ties. Pixel-wise cost and the smoothness constraint are described in the Equation (7). The first sum in this equa- tion is the sum of all matching pixels, the second sum adds a constant penaltyP 1 for all pixelsq in the neigh- bourhoodN p ofp for which the change of disparities is low. The third sum adds a larger constant penaltyP 2 for larger disparity changes [1]. E(D) = X p (C(p;D p )+ X q2Np P 1 T[jD p D q j = 1] + X q2Np P 2 T[jD p D q j> 1] (7) The disparity map is determined by using local stereo methods in a way that for each pixel p the disparity d, which is a minimum cost, must be calculated. The pro- cess can be seen on Figure 1. 3 DispNet neural network The DispNet was created by Mayer et al. and is a classic convolutional neural network without any additional fil- ters in its architecture [6]. For this research, a pretrained DispNet neural network was used. It was trained and fine- tuned on 400 stereo pairs with ground truth disparities from the KITTI 2012 and 2015 training data sets [7], [12]. The network was trained using the Adam optimizer with 1 = 0:9 and 2 = 0:999 with a learning rate = 1e 4 which was divided by 2 every 200 000 iterations starting from iteration 400 000 [13]. Mayer et al. claim that a loss weight schedule is beneficial and started training with a loss weight of 1 assigned to the lowest resolution (loss6) and weight 0 for other losses. During training the weights of losses with higher resolutions were increased, and the weights of lower resolutions were deactivated. Spatial transformations and chromatic transformations were per- formed for data augmentation on the flow data set, and for the disparity data set only chromatic transformations were performed due to possible breakage of the epipolar constraint [6]. Table 1: Specification of DispNet architecture Name Kernel Str. Input conv1 7x7 2 Images conv2 5x5 2 conv1 conv3a 5x5 2 conv2 conv3b 3x3 1 conv3a conv4a 3x3 2 conv3b conv4b 3x3 1 conv4a conv5a 3x3 2 conv4b conv5b 3x3 1 conv5a conv6a 3x3 2 conv5b conv6b 3x3 1 conv6a pr6+loss6 3x3 1 conv6b upconv5 4x4 2 conv6b iconv5 3x3 1 upconv5+pr6+conv5b pr5+loss5 3x3 1 iconv5 upconv4 4x4 2 iconv5 iconv4 3x3 1 upconv4+pr5+conv4b pr4+loss4 3x3 1 iconv4 upconv3 4x4 2 iconv4 iconv3 3x3 1 upconv3+pr4+conv3b pr3+loss3 3x3 1 iconv3 upconv2 4x4 2 iconv3 iconv2 3x3 1 upconv2+pr3+conv2 pr2+loss2 3x3 1 iconv2 upconv1 4x4 2 iconv2 iconv1 3x3 1 upconv1+pr2+conv1 pr1+loss1 3x3 1 iconv1 The DispNet convolutional neural network was cho- sen for this work because it does not have additional lay- ers that were specifically developed for stereo vision prob- lems and could add a better performance to the network [4]. Table 1 presents the DispNet architecture. 4 KITTI Stereo Evaluation 2015 data set This data set consists of 200 training scenes and 200 test scenes with 4 colour images per scene saved in a loss- less PNG format. The ground truth has been established with a semi-automatic process. The evaluation frame- work written in MATLAB comes with a development kit 274 Figure 1: Flowchart of SGM [1]. that computes the percentage of faulty pixels averaged over ground truth pixels. A pixel is correctly estimated if the disparity of flow end-point error is below 3 pixels or 5% of its true value [7]. Figure 2 depicts sample stereo images with ground truth disparity from this data set. Figure 2: KITTI data set stereo images and lidar created dispar- ity map. 5 Testing methodology The performance of algorithm can be assessed with re- spect to the execution time and accuracy in general. How- ever, both algorithms were compared just with respect to the disparity map estimation accuracy in this paper. The performance was evaluated using a custom created script in MATLAB which used the KITTI evaluation frame- work as its base. SGM was implemented using the Open- CV Stereo SGBM method. The method was implemented with the possibility to change the block size. It should be stressed that with block size of 1 this method performs equally as the classic SGM algorithm described in Sec- tion 2 [14]. Other parameters of SGM algorithm are gath- ered in Table 2. Table 2: SGM algorithm parameters. Parameter Value pre filter cap 63 sad window size 3 p1 3*3*4 p2 3*3*32 min disparity 0 num disparities 128 uniqueness ratio 10 speckle window size 100 speckle range 32 disp max diff 1 full dp 1 The DispNet convolutional neural network was im- plemented using fine-tuned weights that are available on the researchers GitHub channel [15][16]. Testing was conducted on the first 100 images of the KITTI data set. Errors by disparity estimation were calculated for each test image, whereat the ground truth and the protocol from KITTI data set were employed. Finally, the obtained er- rors were averaged over entire data set. 6 Results The results are presented in the form of a cumulative er- ror for each image and as the average error over entire KITTI testing data set. Figure 3 depicts qualitative re- sults for the sample images, namely the estimated dis- parity, the ground truth disparity, and the calculated error for SGM method (left column) and DispNet (right col- umn). It could be observed that the ground truth images are in different colours. The reason is differences in the maximum disparity that can be computed by specific dis- parity estimation method. Results are presented in Figure 4 in the form of cumulative errors graphs for each test image from KITTI data set and for both disparity estima- tion method. Values are in percentages. For example, the value of 35% means that the disparity was wrongly deter- mined (i.e., calculated disparity for a pixel differs from the ground truth for more than 3 pixels) for 35% of the test image pixels. It can be easily seen the better perfor- mance of DispNet compared to the SGM algorithm. Ob- tained results are within expected results and as reported on benchmark pages of the KITTI data set [17]. The aver- age error for the SGM algorithm on the complete KITTI test data set is 21.5% ( = 7.7) and 6.1% ( = 3.4) for the DispNet. 7 Conclusion Within this paper, a simple experiment comparing SGM and DispNet methods demonstrated that neural networks possess exceptional strength for disparity map estimation even without additional image processing and transfor- mations. The obtained research results show that neu- ral networks have progressed over the last few years and gained an advantage over classic methods for the dis- parity estimation. The KITTI data set was an excellent choice for benchmarking, because it consists of real im- ages with the addition of ground truth. Results on the KITTI data set pointed out that the DispNet neural net- work estimate disparity accurately for 15% more image pixels on average than the classic SGM method. In addi- tion, the standard deviation is lower as well which means that the disparity estimation using DispNet is more con- sistent. 275 Figure 3: Qualitative results for the sample testing stereo image. Results for the SGM are in the left column, while the DispNet are in the right column. The topmost row depicts estimated disparity maps, in the middle row is a ground truth, while the bottommost row presents the computed error between estimation and ground truth. Figure 4: Cumulative errors (in percent) for each testing im- age from KITTI data set for both disparity estimation methods (SGM and DispNet). References [1] H. Hirschmuller, “Stereo Processing by Semiglobal Match- ing and Mutual Information,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 2, pp. 328–341, Feb. 2008. [2] D. Scharstein, R. Szeliski, and R. Zabih, “A taxonomy and evaluation of dense two-frame stereo correspondence al- gorithms,” in Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001), pp. 131–140. [3] C. C. Aggarwal, Neural Networks and Deep Learning. Springer, 2018. [4] F. Zhang, V . Prisacariu, R. Yang, and P. H. S. Torr, “GA-Net: Guided Aggregation Net for End-to-end Stereo Matching,” CoRR, vol. 1904.06587, 2019. [5] J. ˇ Zbontar and Y . Le Cun, “Computing the stereo matching cost with a convolutional neural network,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015, vol. 07-12-June, pp. 1592–1599. [6] N. Mayer et al., “A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Es- timation,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016, vol. 2016-Decem, pp. 4040–4048. [7] M. Menze and A. Geiger, “Object scene flow for au- tonomous vehicles,” in Proceedings of the CVPR, 2015, vol. 07-12-June, pp. 3061–3070. [8] M. Menze, C. Heipke, and A. Geiger, “Object Scene Flow,” ISPRS, vol. 140, pp. 60–76, 2018. [9] P. Viola and W. M. Wells, “Alignment by Maximization of Mutual Information,” Int. J. Comput. Vis., vol. 24, no. 2, pp. 137–154, 1997. [10] C. Tomasi and S. Birchfield, “Depth discontinuities by pixel-to-pixel stereo,” Int. J. Comput. Vis., vol. 35, no. 3, pp. 269–293, 1998. [11] J. Kim, V . Kolmogorov, and R. Zabih, “Visual Correspon- dence Using Energy Minimization and Mutual Informa- tion,” Proc. Ninth IEEE Int. Conf. Comput. Vis., vol. 2, pp. 1033–, 2003. [12] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361. [13] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” Proc. 3rd Int. Conf. Learn. Represent., pp. 1–15, Dec. 2014. [14] “OpenCV documentation: StereoSGBM Class Reference.” [Online]. Available: https://docs.opencv.org/3.4.1/d2/d85/classcv 1 1StereoSGBM.html. [Accessed: 11-Jun-2019]. [15] A. Tonioni, “Real-time self-adaptive deep stereo GitHub.” [Online]. Available: https://github.com/CVLAB- Unibo/Real-time-self-adaptive-deep-stereo. [16] A. Tonioni, F. Tosi, M. Poggi, S. Mattoccia, and L. Di Stefano, “Real-time self-adaptive deep stereo,” CoRR, vol. 1810.05424, Oct. 2018. [17] “KITTI Stereo Evaluation 2015.” [Online]. Available: http://www.cvlibs.net/datasets/kitti/eval scene flow.php? benchmark=stereo. [Accessed: 13-Jun-2019].