https://doi.org/10.31449/inf.v44i4.3018 Informatica 44 (2020) 437–445 437 A Semi-Supervised Approach to Monocular Depth Estimation, Depth Refinement, and Semantic Segmentation of Driving Scenes using a Siamese Triple Decoder Architecture John Paul T. Yusiong 1;2 and Prospero C. Naval, Jr. 1 1 Computer Vision and Machine Intelligence Group, Department of Computer Science College of Engineering, University of the Philippines, Diliman, Quezon City, Philippines 2 Division of Natural Sciences and Mathematics University of the Philippines Visayas Tacloban College, Tacloban City, Leyte, Philippines E-mail: jtyusiong@up.edu.ph; pcnaval@up.edu.ph Keywords: Siamese triple decoder architecture, depth estimation and refinement, semantic segmentation, semi- supervised learning methods Received: November 30, 2019 Depth estimation and semantic segmentation are two fundamental tasks in scene understanding. These two tasks are usually solved separately, although they have complementary properties and are highly correlated. Jointly solving these two tasks is very beneficial for real-world applications that require both geometric and semantic information. Within this context, the paper presents a unified learning framework for generating a refined depth estimation map and semantic segmentation map given a single image. Specifically, this paper proposes a novel architecture called JDSNet. JDSNet is a Siamese triple decoder architecture that can simultaneously perform depth estimation, depth refinement, and semantic labeling of a scene from an image by exploiting the interaction between depth and semantic information. A semi-supervised method is used to train JDSNet to learn features for both tasks where geometry-based image reconstruction methods are employed instead of ground-truth depth labels for the depth estimation task while ground-truth semantic labels are required for the semantic segmentation task. This work uses the KITTI driving dataset to evaluate the effectiveness of the proposed approach. The experimental results show that the proposed approach achieves excellent performance on both tasks, and these indicate that the model can effectively utilize both geometric and semantic information. Povzetek: V ˇ clanku je predstavljena izvirna metoda delno nadzorovanega uˇ cenja za raznovrstne vizualne naloge. 1 Introduction Scene understanding is crucial for autonomous driving sys- tems since it provides a mechanism to understand the scene layout of the environment [1, 2]. Scene understanding in- volves depth estimation and semantic segmentation, which facilitates the understanding of the geometric and seman- tic properties of a scene, respectively. Depth estimation and semantic segmentation address different areas in scene understanding but have complementary properties and are highly correlated. For semantic segmentation, depth values help improve semantic understanding by enabling the model to gener- ate more accurate object boundaries or differentiate ob- jects having a similar appearance since these values en- code structural information of the scene. On the other hand, for depth estimation, the semantic labels provide valuable prior knowledge to depict the geometric relationships be- tween pixels of different classes and generate better scene layout [3, 4, 5, 6]. Thus, these two fundamental tasks in computer vision can be dealt with in an integrated manner under a unified framework that optimizes multiple objec- tives to improve computational efficiency and performance for both tasks from single RGB images. However, address- ing depth estimation and semantic segmentation simulta- neously where the two tasks can benefit from each other is non-trivial and is one of the most challenging tasks in computer vision given the peculiarities of each task and the limited information that can be obtained from monocular images. Previous works jointly model these two tasks using tra- ditional hand-crafted features and RGB-D images [7, 8]. However, the hand-crafted feature extraction process is quite tedious, and it generally fails to help achieve high accuracies while RGB-D image acquisition is a costly endeavor. To overcome the aforementioned issues, re- searchers employ a unified framework based on deep learn- ing that enables these two tasks to enhance each other using single RGB images only, and this approach led to a signif- icant breakthrough for both tasks [4, 5, 6, 9, 10, 11, 12]. Since these unified frameworks are based on the fully- supervised learning method, they require vast quantities 438 Informatica 44 (2020) 437–445 J.P. T. Yusiong et al. of training images with per-pixel ground-truth semantic la- bels and depth measurements, and obtaining these ground- truths is non-trivial, costly, and labor-intensive. An alter- native approach, as proposed by Ramirez et al. [13], is to integrate depth estimation and semantic segmentation into a unified framework using the semi-supervised learn- ing method. The semi-supervised learning framework re- quires ground-truth semantic labels to provide supervisory signals for the semantic segmentation task, while for the depth estimation task, it employs geometry-based image reconstruction methods that utilize secondary information based on the underlying theory of epipolar constraints in- stead of requiring ground-truth depth measurements during training. In other words, addressing the problem of scene understanding assumes that both stereo image pairs and se- mantic information are available during training since this framework exploits the relationship between the geometric and semantic properties of a scene by performing semantic segmentation in a supervised manner and casting the depth estimation task as an image reconstruction problem in an unsupervised manner. This paper presents another attempt towards addressing the joint inference problem involving depth estimation and semantic segmentation from a single image by proposing to train a novel architecture using a unified learning frame- work based on a semi-supervised technique. This paper introduces a novel Siamese triple decoder architecture with a disparity refinement module and a segmentation fusion module. The triple decoder architecture consists of one shared encoder and three parallel decoders. The dispar- ity refinement module handles visual artifacts and blurred boundaries to generate better depth maps with no border artifacts around the image boundary while the segmenta- tion fusion module generates the semantic segmentation map. In contrast, previous works apply a non-trainable post-processing heuristic during testing to refine the depth estimation outputs of the trained model [13, 14]. Essen- tially, the proposed method enables the model to simulta- neously perform depth estimation, depth refinement, and semantic labeling of a scene from an image by exploiting the interaction between depth and semantic information in an end-to-end manner. The main contributions of this work are the following: 1. It introduces a novel Siamese triple decoder architec- ture with a disparity refinement module and a segmen- tation fusion module, referred to as JDSNet, for depth estimation, depth refinement, and semantic segmenta- tion. 2. It presents a unified framework for joint depth esti- mation with depth refinement and semantic segmenta- tion from a single image based on a semi-supervised technique and trains JDSNet to simultaneously per- form depth estimation, depth refinement, and seman- tic segmentation in an end-to-end manner using rec- tified stereo image pairs with ground-truth semantic labels as training data. 3. It describes a training loss function that optimizes these two tasks concurrently. 4. It demonstrates that the proposed method is capable of simultaneously addressing these two tasks that are mutually beneficial to both tasks. The experimental results prove that jointly solving these two tasks im- proves the performance of both tasks on various eval- uation metrics. The remainder of the paper is arranged as follows. Sec- tion 2 introduces the related works. Section 3 describes the proposed semi-supervised learning framework for si- multaneous monocular depth estimation, depth refinement, and semantic segmentation. Section 4 discusses the exper- imental results using a standard benchmark dataset. Lastly, Section 5 concludes the paper. 2 Related work This section focuses on the previous works that dealt with joint depth estimation and semantic segmentation where re- searchers attempted to develop better-suited models using different methods, such as traditional hand-crafted feature extraction techniques and deep learning-based techniques. The earliest works [7, 8] show the feasibility of jointly modeling depth estimation and semantic segmentation from a single RGB image using the supervised learning method. However, they employ traditional hand-crafted features for these two tasks. The work of Ladicky et al. [7] is considered to be the first to jointly perform monocular depth estimation and semantic segmentation. Using prop- erties of perspective geometry, they proposed an unbiased semantic depth classifier and considered both the loss from semantic and depth labels when training the classifier. They obtained results that outperformed previous state-of-the-art traditional methods in both the monocular depth and se- mantic segmentation domain. But, their model can only generate coarse depth and semantic segmentation maps be- cause the predictions are based on local regions with hand- crafted features. Similarly, Liu et al. [8] carried out these two tasks in a sequential manner where they first performed semantic segmentation and then used the predicted seman- tic labels to improve the depth estimation accuracy. Specif- ically, they used Markov Random Field (MRF) models for depth estimation, where a multi-class image labeling MRF predicts the semantic class for every pixel in the image and uses the predicted semantic labels as priors to estimate depth for each class. By incorporating semantic features, they achieved excellent results with a simpler model that can take into account the appearance and geometry con- straints. Other researchers [6, 12, 13] use deep learning tech- niques for joint monocular depth estimation and seman- tic segmentation from a single image to improve the per- formance of each task. These works [6, 12] performed depth estimation and semantic labeling using the super- A Semi-supervised Approach to Monocular. . . Informatica 44 (2020) 437–445 439 vised learning method while Ramirez et al. [13] used the semi-supervised learning method. Wang et al. [6] and Mousavian et al. [12] used deep network architecture to simultaneously perform depth esti- mation and semantic segmentation and used a Conditional Random Field (CRF) to combine the depth and seman- tic information. Specifically, Wang et al. [6] proposed a two-layer Hierarchical Conditional Random Field (HCRF), which employs two convolutional neural networks (CNNs) to extract local and global features and then these fea- tures are enhanced using CRF. Their proposed approach en- abled them to obtain promising results in both the monoc- ular depth and semantic segmentation domain. On the other hand, Mousavian et al. [12] introduced a multi-scale CNN to perform depth estimation and semantic segmenta- tion and combined them using a CRF. As shown in their work, the proposed model achieved comparable results on monocular depth estimation but outperformed the state-of- the-art methods on semantic segmentation. A more recent work by Ramirez et al. [13] proposed to solve the joint in- ference problem using a semi-supervised learning method where they employed a deep network architecture that can be jointly optimized for depth estimation and semantic segmentation where ground-truth semantic labels are re- quired for the semantic segmentation task while geometry- based image reconstruction methods are employed instead of ground-truth depth labels for the depth estimation task. However, the experimental results reveal that their model, which was jointly trained for depth prediction and seman- tic segmentation, only improved the depth estimation accu- racy. Their model failed to obtain better results for seman- tic segmentation. This work addresses past design issues to obtain signifi- cant improvements when simultaneously performing depth estimation and semantic segmentation using rectified stereo image pairs with ground-truth semantic labels as training data. Specifically, to produce better depth estimates and semantic labeling, the proposed method involves changing the essential building blocks of the network architecture and introducing a disparity refinement module and a seg- mentation fusion module to generate better quality depth maps and semantic segmentation maps. 3 Proposed method This section describes the proposed method for simulta- neous depth estimation, depth refinement, and semantic segmentation in a semi-supervised manner using rectified stereo image pairs (I L ; I R ) with ground-truth semantic la- bels seg gt as training data. Since the training data does not have ground-truth depth labels, the right imagesI R to- gether with the predicted disparitiesD L1 are used to obtain supervisory signals for the depth estimation task based on the underlying theory of epipolar constraints during train- ing. In short, the supervisory signal is generated by warp- ing one view of a stereo pair into the other view using the predicted disparity maps. Figure 1 presents the semi- supervised framework for joint monocular depth estimation and semantic segmentation using JDSNet. JDSNet is the proposed Siamese triple decoder architecture with a dispar- ity refinement module and a segmentation fusion module. I R I * L I * R I R I L Image Reconstruction Test Phase seg gt Training Phase I L I flipL Disparity Refinement Module D final L D final R D L1 Flip Shared Parameters D R1 seg 1 D L2 D R2 seg 2 Segmentation Fusion Module seg final I L I flipL Disparity Refinement Module D final L D final R D L1 Flip Shared Parameters D R1 seg 1 D L2 D R2 seg 2 Segmentation Fusion Module seg final Figure 1: A semi-supervised framework for joint monocu- lar depth estimation and semantic segmentation using JD- SNet, the proposed Siamese triple decoder architecture. 3.1 Network architecture The semi-supervised framework uses a Siamese architec- ture with the triple decoder network as the autoencoder. A Siamese architecture consists of two symmetrical struc- tures and accepts two distinct images as inputs. An im- portant feature of a Siamese architecture is that it uses two copies of the same network, and these two networks share weight parameters to process the two different inputs and generate two outputs. The original purpose of using a 440 Informatica 44 (2020) 437–445 J.P. T. Yusiong et al. Siamese architecture is for learning similarity representa- tions, that is, to predict whether the two inputs are similar or not [15, 16]. However, in this study, the two outputs of the Siamese network are combined to produce the re- fined disparity maps through the disparity refinement mod- ule and a segmentation map through the segmentation fu- sion module. JDSNet consists of two triple decoder networks that share weight parameters. It also has a disparity refinement module that enables the network to more effectively han- dle the visual artifacts and blurred boundaries while learn- ing depth estimation. The disparity refinement module is the trainable version of the post-processing heuristic intro- duced by Godard et al. [14]. This module combines and refines the two pairs of depth maps. The segmentation fu- sion module combines the outputs from the two semantic segmentation decoders. The Siamese triple decoder network receives the origi- nal left images I L and the horizontally flipped version of the left input imagesI flipL as inputs. With these images, the network is trained to predict depth maps, refine the predicted depth maps, and generate semantic segmentation maps. The horizontally flipped version of the left input images I flipL is necessary because in reconstructing the left im- ages from the right images using the predicted disparities, there are pixels in the left images that are not present in the right images. Hence, no depth values can be predicted for these missing pixels. To overcome this limitation, the horizontally flipped version of the left input imagesI flipL enables the network to predict the depth values of the oc- cluded pixels, and by using the disparity refinement mod- ule, the predicted disparities from both inputs are combined to generate a refined disparity map. A triple decoder network has a shared encoder and three parallel decoders that can be trained for depth estimation and semantic segmentation. The shared encoder is based on the encoder section of the AsiANet network architec- ture [17]. The encoded feature vectors are forwarded to the three parallel decoders: two depth decoders and one se- mantic segmentation decoder. The first depth decoder pre- dicts the left disparity map and is constructed similar to the decoder section of AsiANet [17], while the second depth decoder that predicts the right disparity map and the seman- tic segmentation decoder are based on the ResNet50 de- coders described in [13]. However, the last encoder block is modified due to hardware limitations where the number of output channels is reduced from 2048 to 1024. Also, unlike the previous works [13, 17], where a depth decoder generates two disparity maps when using rectified stereo image pairs as training data, each depth decoder in the pro- posed network generates a single disparity map. The Siamese triple decoder network generates a pair of refined disparity maps (D final L ; D final R ) at four different scales and a semantic segmentation map seg final at full resolution only from the left imageI L . However, only the full resolution of the refined left disparity mapD final L and semantic segmentation map are useful at test time. 3.1.1 Disparity refinement module The disparity refinement module is based on the post- processing heuristic introduced by Godard et al. [14]. It is incorporated as a trainable component of the proposed Siamese triple decoder network rather than having a refine- ment step at test time since it decouples the refined dis- parity maps from the training. This design choice enables the network to simultaneously learn depth estimation and refine the predicted depth map in an end-to-end manner. Essentially, the disparity refinement module performs three operations: horizontal flip operation, pixel-wise mean operation, and disparity ramps removal operation. The hor- izontal flip operation is performed on the disparity maps (D L2 ; D R2 ) to generate (D flipL ; D flipR ). Afterwards, the pixel-wise mean operation and the disparity ramps removal operation are performed on (D L1 ; D flipL ) and (D R1 ; D flipR ), respectively, to produce the refined dis- parity maps (D final L ; D final R ). 3.1.2 Segmentation fusion module The segmentation fusion module performs a horizontal flip operation on seg 2 to obtain seg flip . It then adds the two layersseg 1 andseg flip and forwards it to the softmax layer to output the probabilistic scores for each class and gener- ate a semantic segmentation mapseg final . 3.2 Loss function Training the proposed network relies on a loss function that can be expressed as a weighted sum of two losses, as de- fined in equation (1); a depth loss and a semantic segmen- tation loss, and the term is given by L Total = depth L depth + seg L seg ; (1) whereL depth is the depth loss term,L seg is the seman- tic segmentation loss term, and depth , seg are the loss weightings for each term. 3.2.1 Depth loss term As defined in equation (2),L depth is the sum of the depth losses at four different scales whereL s is the depth loss at each scale.L s is a combination of three terms - appearance dissimilarity, disparity smoothness, and left-right consis- tency. This term is given by L depth = 4 X s=1 L s ; (2) L s = app L app + sm L sm + lr L lr ; (3) L app = L left app +L right app ; (4) L sm = L left sm +L right sm ; (5) L lr = L left lr +L right lr ; (6) A Semi-supervised Approach to Monocular. . . Informatica 44 (2020) 437–445 441 where L app is the appearance dissimilarity term, L sm is the edge-aware disparity smoothness term, L lr is the left-right consistency term, and app ; sm ; lr are the loss weightings for each term. The depth loss term takes into account the left and right images where each component is in terms of the left images (L left app ;L left sm ;L left lr ) and right images (L right app ;L right sm ;L right lr ). However, this section pro- vides details for the left components L left only since the right componentsL right are defined symmetrically. The appearance dissimilarity term, as defined in (7), is a linear combination of the single-scale structural similar- ity (SSIM) term [18] and the L 1 photometric term. This term measures the quality of the synthesized target image by minimizing the pixel-level dissimilarity between the tar- get imageI and the synthesized target imageI . This term is also widely used in previous studies [13, 14, 17] and it is given by L left app = 1 N X x;y ! 1 SSIM (IL (x;y);I L (x;y)) 2 + (1 !)kIL (x;y) I L (x;y)k (7) with a 3 3 box filter for the SSIM term and! is set to 0:85 similar to [13, 14, 17]. The synthesized target left im- ageI L is obtained using a sampler from the spatial trans- former network [19] that performs the bilinear interpola- tion. The sampler reconstructs the target left imageI L us- ing the right imageI R and the predicted left disparity map D L1 . The edge-aware disparity smoothness term, as defined in (8), regularizes the predicted disparities in spatially sim- ilar areas to ensure that the predicted disparities are locally smooth but can be sharp at the edges. This term is given by left sm = 1 N X x;y ((j@xDL2(x;y)je j @xI flipL (x;y)j +j@yDL2(x;y)je j @yI flipL (x;y)j ) + (j@xD final L (x;y)je j @xI L (x;y)j +j@yD final L (x;y)je j @yI L (x;y)j )); (8) where D final L is the refined left disparity map, D L2 is the second predicted left disparity map, and I flipL is the horizontally flipped version of the left imageI L . As described in [13, 14, 17], the left-right consistency term enforces consistency between the left and right dis- parities as defined in (9). This term is given by L left lr = 1 N X x;y jD final L (x;y) (DR1(x DL1(x;y);y))j; (9) whereD final L is the refined left disparity map,D R1 is the first predicted right disparity map, andD L1 is the first pre- dicted left disparity map. 3.2.2 Semantic segmentation loss term The semantic segmentation loss term, as defined in equa- tion (10), is the standard cross-entropy loss between the predicted pixel-wise semantic labelsseg final and ground- truth pixel-wise semantic labelsseg gt . The semantic seg- mentation loss is computed using the left images only since these images have the corresponding ground-truth semantic labels at full image resolution. This term is given by L seg = N X i=1 P (seg gt i jseg final i ); (10) where seg final i is the pixel-wise prediction for image I i ,seg gt i is the ground-truth semantic labels for imageI i , P (yjx) = P j p(y j jx j ), andp(y j jx j ) is the probability of the ground-truth semantic labely j at pixelj. 3.3 Datasets and evaluation metrics Although the Cityscapes dataset [20] and KITTI dataset [21] contain a large number of training sam- ples, the proposed semi-supervised learning framework for simultaneous depth estimation, depth refinement, and semantic segmentation requires rectified stereo image pairs with pixel-wise ground-truth semantic labels at training time. Hence, a subset of the Cityscapes dataset, which contains 2; 975 finely annotated images and the KITTI dataset consisting of 200 images with pixel-wise semantic ground-truth labels are used in this work. Ramirez et al. [13] introduced a train/test split from the 200 images of the KITTI dataset for joint depth estimation and semantic segmentation. This dataset was split into 160 samples for the train set and 40 samples for the test set. The test set of 40 samples was used to quantitatively eval- uate the proposed method given the distance range of 0-80 meters. The standard evaluation metrics are used to evaluate the trained models quantitatively. The standard evaluation met- rics for depth estimation measure the average errors, where lower values are better and accuracy scores where higher values are preferred [14, 22]. The six standard metrics for depth estimation are absolute relative difference (ARD), square relative difference (SRD), linear root mean square error (RMSE-linear), log root mean square error (RMSE- log), and the percentage of pixels (accuracy score) with thresholds (t) of 1:25, 1:25 2 , and 1:25 3 . These metrics are defined in Eq. (11) to Eq.(15). ARD = 1 N X jd p i d g i j d g i (11) SRD = 1 N X jjd p i d g i jj 2 d g i (12) RMSE linear = r 1 N X jjd p i d g i jj 2 (13) RMSE log = r 1 N X jjlog(d p i ) log(d g i )jj 2 (14)