ERK'2022, Portorož, 368-372 368 Towards on-the fly multi-modal sensor calibration Jon Muhoviˇ c 1 , Janez Perˇ s 1 Univerza v Ljubljani 1 Fakulteta za elektrotehniko, Trˇ zaˇ ska cesta 25, 1000 Ljubljana jon.muhovic@fe.uni-lj.si Abstract The robustness of autonomous vehicles can be signifi- cantly improved by using multiple sensor modalities. In addition to standard color cameras and less frequently used thermal, multispectral and polarization cameras, LIDAR and RADAR are most often used sensors, and are largely complementary to image sensors. However, the spatial calibration of such a system can be extremely challenging due to the difficulties in obtaining corre- sponding features from different modalities, as well as the inevitable parallax arising from different sensor posi- tions. In this paper, we present a comprehensive strategy for calibrating such a system using a multi-modal target, and illustrate how such a strategy could be upgraded to an fully automatic, target-less calibration that would rely on features of the scene itself to align at least small sensor offsets from the calibrated position. We find that a high-level understanding of the scene is ideal for this task, as this way we can identify characteristic points for spatial alignment of sensor data of different modalities. 1 Introduction Multi-sensor systems are widely used for their robustness to adverse effects that might hinder one of the sensors, but not the others. Additionally, different modalities can pro- duce richer information that allows easier or higher qual- ity scene understanding. But a necessary requirement for such systems is the establishment of relative inter-sensor position and rotation, i.e. system calibration. While some sensors lend themselves to this problem quite well, sev- eral sensor modalities can prove problematic. Correctly aligning cameras with different fields of view that are not mounted in a stereo fashion or have visually very differ- ent modalities is an especially difficult task. Our pro- posed method tries to solve this by using a custom cali- bration target and computational approaches that cover a wide range of modalities and ensure that even very dif- ferent types of sensors can be calibrated. 2 Related work With the advent of research into self-driving vehicles, a large body of work has been done on calibrating multi- sensor systems. This is mostly LIDAR-camera calibra- tion, but many works also address calibrating thermal and IR sensors with visual light cameras, mostly for enabling pedestrian detection in low-light circumstances. While stereo camera systems are a tried and true method for calibration [1], many different approaches are still being developed for calibrating more heterogeneous systems. This stems from the fact that different sensors require different approaches, and not all calibrations methods are suitable for various combinations of sensors. Many approaches for calibrating a LIDAR sensor with a visual light camera have been proposed: some of them [2, 3, 4, 5, 6] use planar targets to estimate the mutual position of the sensors, while others use information already present in the observed scene to estimate said position [7, 8] or even couple that approach with rich semantic information [9]. While most of these methods rely on geometric analysis of both sensors’ properties, CNNs have also been used to extract the pertinent information from both sensors and estimate their relative positions [10]. Many methods and datasets are being developed for pedestrian detection in low-light scenes. This includes dataset such as LLVIP [11] that uses infrared images cou- pled with visible light images and KAIST [12] which contains RGB and thermal images. As discussed in Sec- tion 3.1, proper image alignment can only be achieved by using a beam splitter or by employing absolute depth measurement. Several methods have been proposed to align images of different modalities using external depth information [13, 14]. A newer approach to image align- ment is also using neural networks to extract and align the features in the images automatically [15]. Authors Wang et al. [16] propose a GAN-based approach for refining weakly aligned image pairs. A large body of work has recently been presented for semantic segmentation of urban scenes, primarily from an autonomous vehicle standpoint. For the purpose of autonomous driving, reliable detection of drivable surfaces and potential obstacles is crucial, even in disadvantageous circumstances. On the Cityscapes dataset [17], the state-of-the-art position is currently held by the transformer-based method SegFormer [18]. 3 Proposed calibration strategy Our approach is based on a target that can be easily de- tected in all modalities and is set up so that useful features 369 can be extracted and used for calibration. The key is the use of LIDAR as the focal point of the calibration, due to its 360 ◦ field of view and absolute distance measure- ments. Our calibration procedure relies on detecting a physical target in all of the sensors and extracting features that will allow us to establish the relative positioning of the sensors. 3.1 The parallax problem Calibrating cameras with absolute distance sensors such as LIDAR can prove difficult, but it is ultimately exactly solvable, i.e. when the precise rotation and translation parameters are established, the calibration will be correct for any observed point in space. Due to 2D projection that is inherent in cameras, alignment between multiple cameras does not work in the same way. In order to per- fectly align two images, one of two conditions must be satisfied: either both cameras lie in the exact same point in space or the observed scene must lie infinitely far away. Otherwise, points at different distances from the cameras will map slightly differently, which can be observed as the parallax effect. While stereo camera systems use this property to estimate absolute distance, it presents an ob- stacle in multi-modal camera systems. This can be solved by using external depth measurements to correctly calcu- late the mapping between images, and this is exactly the advantage of LIDAR. 3.2 Target detection Localizing the calibration target in visual spectrum cam- eras is relatively straightforward, as we are employing a standard asymmetric circle grid commonly used for cam- era calibration. Since the size of the target and the cam- era’s intrinsic parameters are known, we can find the lo- cation of each point of the grid in the camera image using a perspective-n-point algorithm. This gives us an approx- imation of the relative position of the camera to the target. While detecting the calibration grid is important for cal- culating the intrinsic parameters of cameras, we use it as a way of reliably detecting the entire area of the calibra- tion target. As the size of the target is known, we can extract the locations of its corners with no extra effort. If using a pinhole camera model, just connecting the cor- ners with straight lines gives us the contour of the target in the image plane. In order to extract the target position from LIDAR data, we have to find features of our target we can de- tect reliably. Since our LIDAR sensor is arranged in 16 beams and rotates continuously, we can arrange the scan points by beam index as well as by azimuth. The fea- tures we wish to detect in our calibration target are its edges. If we observe points along a single LIDAR beam, object edges should be characterized by a large differ- ence in distance between neighboring points. By filtering out LIDAR scan by this criterion we can extract potential edges in the scene which should also include the edges of our calibration target as shown in Figure 1. Figure 1: Lidar point cloud of the calibration target. Li- dar points are depicted with red, while edge points are marked with blue. The target shape is clearly visible. 3.3 Position optimization The problem of calibrating LIDAR sensors with cameras is the proper alignment of corresponding features. If the transformation between the sensors’ coordinate systems is known, the alignment would be perfect. In that case, the features from both sensors would be aligned. This serves as a foundation on which the calibration can be optimized. In order to compare both modalities, edge features should be detected in both modalities, then the LIDAR points representing edges should be projected to the image plane. A cost function that measures the alignment between edges in both LIDAR and image modalities can be formulated. In Section 3.2 we described the process of obtaining the image of target edges. Our cost function should measure how well the projected LIDAR points are aligned with said image edges. The cost functionC can be formulated as: C(I,P ) = N X i=0 I(P i ) (1) whereN is the number of projected LIDAR points,I is the image of the target edges andP i are the 2D coordi- nates of i-th LIDAR point. Since the Gaussian kernel is used for creatingI, its intensities are highest on the target edges and the respective intensity gradient should allow convergence. Using this cost function, standard gradient descent can be applied in order to maximize the alignment score. The parameters of the function are the rotation and trans- lation components of the 4× 4 transformation matrix used to transform the LIDAR coordinate system to the camera coordinate system. Knowing the camera intrinsic param- eters, the LIDAR points can be projected onto the im- age plane and the cost function can be evaluated. Since the cost function cannot be analytically derived, numer- ical derivative estimation is used instead. A numerical derivative has to be calculated for each of the parameters at each optimization step. We use a central difference for- mula that approximates the derivative of our cost function as follows: f ′ (x)≈ f(x +h)− f(x− h) 2h (2) 370 whereh is the step size. The transformation matrix is formulated as a combi- nation of the rotation matrixR and translation vectorT and as such has 6 free parameters if we encode the ro- tation using Euler angles. The set of parameters for one LIDAR-camera pair can thus be represented as: Ω = [ x,y,z,ϑ,φ,ψ ] (3) where the Euler angles that describe the rotation about each of the three axes areϑ , φ , ψ , which represent roll, pitch and yaw, respectively. Since a solution for a single viewpoint can be only locally correct, using several images with the target in different positions relative to the camera helps to gener- alize the solution. The approach is similar to using mini- batches in deep learning, as gradients for each image can be calculated at each step and the mean gradient can be used to update the solution. 3.4 Camera-camera calibration Estimation of relative camera positions is a difficult task usually reserved for well set-up cases such as stereo sys- tems where only offset on one axis is expected. Relative position estimation for arbitrary camera positions can be done by detecting a known object, calculating the essen- tial matrix and decomposing it. But the procedure is not very numerically stable, nor is it guaranteed to produce the real calibration parameters even when using multiple target positions. Instead, we obtain the relative camera positions transitively after the respective LIDAR-camera calibrations have been completed. Given that each of LIDAR-camera calibrations is estimated using absolute distance measurements projected to the image, the mar- gin for error is much smaller than if target position esti- mates were used. LetC 1 andC 2 be coordinate systems of cameras we wish to align andL be the LIDAR coor- dinate system. We will use the notation A H B to signify the transformation from coordinate system A to coordi- nate systemB. When calibratingC 1 andC 2 we calculate L H C1 and L H C2 respectively. We can formulate relative positions of cameras as follows: C1 H C2 = ( L H C1 ) − 1 L H C2 . (4) The mapping of pixels between images can then be realized by inverse projection as follows:   ˜ X ˜Y ˜ Z   =K − 1   u v 1   , (5) where K is the camera calibration matrix and u,v are the pixel positions. The inverse projection produces 3D coordinates ˜ X, ˜Y and ˜ Z which are independent of the camera focal length. If this calculation is performed for all image pixels, this essentially creates a plane at dis- tance ˜ Z = 1 in 3D space which, when projected back usingK, perfectly reconstructs the original image. If we transform these points (call them ˜I) to a different coor- dinate system, we can project them onto the image plane C 1 P P’ C 2 d p 1 p 2 p 2 ’ 1 Figure 2: Depiction of image alignment process. If cor- rect depth d of 3D point P is not known, correct pixel mapping can be only approximate, even if the calibration between cameras is known. of another camera. However, this mapping will only be correct for objects at distance 1 from the original camera. Since points ˜I effectively lie on rays connecting some 3D point at unknown depth to the camera center, they can be multiplied by an arbitrary number, thus changing the simulated depth of the points with respect to the cam- era. This allows us to create maps between cameras at arbitrary depths. If we are only interested in observing objects at a specific distance from the cameras, this gives us perfect image alignment. Given this formulation, it can be observed that each point can be moved away from the camera center inde- pendently. If some measurement or estimation of abso- lute depth is available for a pixel, alignment can be gen- eralized to produce accurate results for any depth. This setup is shown in Figure 2, where the point p 1 on the image plane of the camera C 1 is unprojected into 3D space, generating the point p ′ , which is then projected to the cameraC 2 with some error based on the depth er- ror. If, however, a more precise measurement of absolute distance d, such as a LIDAR sensor, is used, the posi- tion of the original point P can be better estimated and its projectionp 2 will be positioned correctly. It must be emphasized that this approach requires additional depth information if it is to provide accurate per-pixel alignment of two or more images. But it is geometrically the only viable option if our goal is to align more than two sensors, where accurate beam splitter approaches might become prohibitively difficult. While CNN-based feature alignment is possible, it is reasonable to employ as much accurate geometric approaches before using more data-driven methods, which can be highly susceptible to input data noise. 4 Experiments and results We tested our approach on multiple modalities, but we present only RGB-polarization camera results. After each camera-LIDAR pair was calibrated, the images could be mapped into a joint viewpoint, as described 371 Figure 3: Mapping between ZED RGB camera, polarization camera and LIDAR, from left to right: raw RGB image, LIDAR data projected to RGB image, image from the polarization camera overlaid onto the RGB image. in Section 3.4. We can then verify the correctness of our method by comparing the position of detected target points in the first image with the detected points in the second image and observe the reprojection error. Across the 15 images we selected for calibration, the average reprojection error for RGB-polarization setup is 7.6px. 4.1 Calibration in the wild In complex multi-modal sensor systems, geometric de- calibration due to slight sensor displacement or rotation while in the field is unfortunately a realistic scenario. In this case, the calibration target is not at hand, and cannot be used to correct the decalibration. However, note that our approach relies on detecting and segmenting a single object (target) in multiple modalities. In the laboratory this could be a target of known dimensions with standard dot pattern, in the wild it could be any object that can be segmented in multiple modalities. We only present a visual illustration of this approach in Fig. 4. Note that the trees on the left side are perfectly segmented in LIDAR space, and slightly worse on the right. Semantic segmentation on RGB data is not perfect though, but this could be markedly improved using algorithms targeted to water environment, such as WaSR [19]. 5 Conclusion The presented calibration method solves multiple prob- lems of multi-modal, multi-sensor calibration and works well in the laboratory conditions. We plan to extend this concept to the calibration in outdoor, target-less environ- ment, where features from multiple deep architectures could be used to the same goal: finding reliable targets (objects) in images and LIDAR point clouds, using LIDAR distance measurements and proceeding with the proposed approach that works well in the lab. Zahvala To delo je bilo financirano s strani Javne agencije za raziskovalno dejavnost (ARRS), v okviru programa [P2-0095], in raziskovalnega projekta [J2-2506]. References [1] H. Takahashi and F. Tomita, “Self-calibration of stereo cameras,” in 1988 Second International Conference on Computer Vision. IEEE, 1988, pp. 123–128. Figure 4: Multisensor data on the Ljubljanica river. Top: LIDAR data overlaid on RGB image. Bottom: seman- tic labels, obtained by SegFormer [18], overlaid on RGB image. [2] E.-S. Kim and S.-Y . Park, “Extrinsic calibration of a camera-lidar multi sensor system using a planar chess- board,” in 2019 Eleventh International Conference on Ubiquitous and Future Networks (ICUFN). IEEE, 2019, pp. 89–91. [3] Z. Pusztai, I. Eichhardt, and L. Hajder, “Accurate cal- ibration of multi-lidar-multi-camera systems,” Sensors, vol. 18, no. 7, p. 2139, 2018. [4] K. Banerjee, D. Notz, J. Windelen, S. Gavarraju, and M. He, “Online camera lidar fusion and object detection on hybrid data for autonomous driving,” in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1632–1638. [5] J.-K. Huang and J. W. Grizzle, “Improvements to target- based 3d lidar to camera calibration,” IEEE Access, vol. 8, pp. 134 101–134 110, 2020. [6] C. Guindel, J. Beltr´ an, D. Mart´ ın, and F. Garc´ ıa, “Auto- matic extrinsic calibration for lidar-stereo vehicle sensor setups,” in 2017 IEEE 20th International Conference on 372 Intelligent Transportation Systems (ITSC). IEEE, 2017, pp. 1–6. [7] G. Pandey, J. R. McBride, S. Savarese, and R. M. Eu- stice, “Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information,” in Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012. [8] M. ´ A. Mu˜ noz-Ba˜ n´ on, F. A. Candelas, and F. Torres, “Tar- getless camera-lidar calibration in unstructured environ- ments,” IEEE Access, vol. 8, pp. 143 692–143 705, 2020. [9] Y . Zhu, C. Li, and Y . Zhang, “Online camera-lidar cali- bration with sensor semantic information,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 4970–4976. [10] G. Iyer, R. K. Ram, J. K. Murthy, and K. M. Krishna, “Calibnet: Geometrically supervised extrinsic calibration using 3d spatial transformer networks,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS). IEEE, 2018, pp. 1110–1117. [11] X. Jia, C. Zhu, M. Li, W. Tang, and W. Zhou, “Llvip: A visible-infrared paired dataset for low-light vision,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3496–3504. [12] S. Hwang, J. Park, N. Kim, Y . Choi, and I. So Kweon, “Multispectral pedestrian detection: Benchmark dataset and baseline,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1037– 1045. [13] J. Rangel, S. Soldan, and A. Kroll, “3d thermal imaging: Fusion of thermography and depth cameras,” in Interna- tional Conference on Quantitative InfraRed Thermogra- phy, vol. 3, 2014. [14] S. S. Shivakumar, N. Rodrigues, A. Zhou, I. D. Miller, V . Kumar, and C. J. Taylor, “Pst900: Rgb-thermal cali- bration, dataset and segmentation network,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 9441–9447. [15] C. Walters, O. Mendez, M. Johnson, and R. Bowden, “There and back again: Self-supervised multispectral correspondence estimation,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 5147–5154. [16] Y . Wang and D. Wijesekera, “Pixel invisibility: Detect- ing objects invisible in color images,” arXiv preprint arXiv:2006.08383, 2020. [17] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En- zweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene under- standing,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition, 2016, pp. 3213– 3223. [18] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Al- varez, and P. Luo, “Segformer: Simple and efficient de- sign for semantic segmentation with transformers,” Ad- vances in Neural Information Processing Systems, vol. 34, pp. 12 077–12 090, 2021. [19] B. Bovcon and M. Kristan, “A water-obstacle separation and refinement network for unmanned surface vehicles,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 9470–9476.