ERK'2022, Portorož, 368-372 368
Towards on-the fly multi-modal sensor calibration
Jon Muhoviˇ c
1
, Janez Perˇ s
1
Univerza v Ljubljani
1
Fakulteta za elektrotehniko, Trˇ zaˇ ska cesta 25, 1000 Ljubljana
jon.muhovic@fe.uni-lj.si
Abstract
The robustness of autonomous vehicles can be signifi-
cantly improved by using multiple sensor modalities. In
addition to standard color cameras and less frequently
used thermal, multispectral and polarization cameras,
LIDAR and RADAR are most often used sensors, and
are largely complementary to image sensors. However,
the spatial calibration of such a system can be extremely
challenging due to the difficulties in obtaining corre-
sponding features from different modalities, as well as
the inevitable parallax arising from different sensor posi-
tions. In this paper, we present a comprehensive strategy
for calibrating such a system using a multi-modal target,
and illustrate how such a strategy could be upgraded
to an fully automatic, target-less calibration that would
rely on features of the scene itself to align at least small
sensor offsets from the calibrated position. We find that
a high-level understanding of the scene is ideal for this
task, as this way we can identify characteristic points for
spatial alignment of sensor data of different modalities.
1 Introduction
Multi-sensor systems are widely used for their robustness
to adverse effects that might hinder one of the sensors, but
not the others. Additionally, different modalities can pro-
duce richer information that allows easier or higher qual-
ity scene understanding. But a necessary requirement for
such systems is the establishment of relative inter-sensor
position and rotation, i.e. system calibration. While some
sensors lend themselves to this problem quite well, sev-
eral sensor modalities can prove problematic. Correctly
aligning cameras with different fields of view that are not
mounted in a stereo fashion or have visually very differ-
ent modalities is an especially difficult task. Our pro-
posed method tries to solve this by using a custom cali-
bration target and computational approaches that cover a
wide range of modalities and ensure that even very dif-
ferent types of sensors can be calibrated.
2 Related work
With the advent of research into self-driving vehicles, a
large body of work has been done on calibrating multi-
sensor systems. This is mostly LIDAR-camera calibra-
tion, but many works also address calibrating thermal and
IR sensors with visual light cameras, mostly for enabling
pedestrian detection in low-light circumstances.
While stereo camera systems are a tried and true
method for calibration [1], many different approaches are
still being developed for calibrating more heterogeneous
systems. This stems from the fact that different sensors
require different approaches, and not all calibrations
methods are suitable for various combinations of sensors.
Many approaches for calibrating a LIDAR sensor
with a visual light camera have been proposed: some
of them [2, 3, 4, 5, 6] use planar targets to estimate
the mutual position of the sensors, while others use
information already present in the observed scene to
estimate said position [7, 8] or even couple that approach
with rich semantic information [9]. While most of these
methods rely on geometric analysis of both sensors’
properties, CNNs have also been used to extract the
pertinent information from both sensors and estimate
their relative positions [10].
Many methods and datasets are being developed for
pedestrian detection in low-light scenes. This includes
dataset such as LLVIP [11] that uses infrared images cou-
pled with visible light images and KAIST [12] which
contains RGB and thermal images. As discussed in Sec-
tion 3.1, proper image alignment can only be achieved
by using a beam splitter or by employing absolute depth
measurement. Several methods have been proposed to
align images of different modalities using external depth
information [13, 14]. A newer approach to image align-
ment is also using neural networks to extract and align the
features in the images automatically [15]. Authors Wang
et al. [16] propose a GAN-based approach for refining
weakly aligned image pairs.
A large body of work has recently been presented
for semantic segmentation of urban scenes, primarily
from an autonomous vehicle standpoint. For the purpose
of autonomous driving, reliable detection of drivable
surfaces and potential obstacles is crucial, even in
disadvantageous circumstances. On the Cityscapes
dataset [17], the state-of-the-art position is currently held
by the transformer-based method SegFormer [18].
3 Proposed calibration strategy
Our approach is based on a target that can be easily de-
tected in all modalities and is set up so that useful features
369
can be extracted and used for calibration. The key is the
use of LIDAR as the focal point of the calibration, due
to its 360
◦ field of view and absolute distance measure-
ments. Our calibration procedure relies on detecting a
physical target in all of the sensors and extracting features
that will allow us to establish the relative positioning of
the sensors.
3.1 The parallax problem
Calibrating cameras with absolute distance sensors such
as LIDAR can prove difficult, but it is ultimately exactly
solvable, i.e. when the precise rotation and translation
parameters are established, the calibration will be correct
for any observed point in space. Due to 2D projection
that is inherent in cameras, alignment between multiple
cameras does not work in the same way. In order to per-
fectly align two images, one of two conditions must be
satisfied: either both cameras lie in the exact same point
in space or the observed scene must lie infinitely far away.
Otherwise, points at different distances from the cameras
will map slightly differently, which can be observed as
the parallax effect. While stereo camera systems use this
property to estimate absolute distance, it presents an ob-
stacle in multi-modal camera systems. This can be solved
by using external depth measurements to correctly calcu-
late the mapping between images, and this is exactly the
advantage of LIDAR.
3.2 Target detection
Localizing the calibration target in visual spectrum cam-
eras is relatively straightforward, as we are employing a
standard asymmetric circle grid commonly used for cam-
era calibration. Since the size of the target and the cam-
era’s intrinsic parameters are known, we can find the lo-
cation of each point of the grid in the camera image using
a perspective-n-point algorithm. This gives us an approx-
imation of the relative position of the camera to the target.
While detecting the calibration grid is important for cal-
culating the intrinsic parameters of cameras, we use it as
a way of reliably detecting the entire area of the calibra-
tion target. As the size of the target is known, we can
extract the locations of its corners with no extra effort. If
using a pinhole camera model, just connecting the cor-
ners with straight lines gives us the contour of the target
in the image plane.
In order to extract the target position from LIDAR
data, we have to find features of our target we can de-
tect reliably. Since our LIDAR sensor is arranged in 16
beams and rotates continuously, we can arrange the scan
points by beam index as well as by azimuth. The fea-
tures we wish to detect in our calibration target are its
edges. If we observe points along a single LIDAR beam,
object edges should be characterized by a large differ-
ence in distance between neighboring points. By filtering
out LIDAR scan by this criterion we can extract potential
edges in the scene which should also include the edges of
our calibration target as shown in Figure 1.
Figure 1: Lidar point cloud of the calibration target. Li-
dar points are depicted with red, while edge points are
marked with blue. The target shape is clearly visible.
3.3 Position optimization
The problem of calibrating LIDAR sensors with cameras
is the proper alignment of corresponding features. If the
transformation between the sensors’ coordinate systems
is known, the alignment would be perfect. In that case,
the features from both sensors would be aligned. This
serves as a foundation on which the calibration can be
optimized. In order to compare both modalities, edge
features should be detected in both modalities, then the
LIDAR points representing edges should be projected to
the image plane.
A cost function that measures the alignment between
edges in both LIDAR and image modalities can be
formulated. In Section 3.2 we described the process of
obtaining the image of target edges. Our cost function
should measure how well the projected LIDAR points
are aligned with said image edges. The cost functionC
can be formulated as:
C(I,P ) =
N
X
i=0
I(P
i
) (1)
whereN is the number of projected LIDAR points,I is
the image of the target edges andP
i
are the 2D coordi-
nates of i-th LIDAR point. Since the Gaussian kernel is
used for creatingI, its intensities are highest on the target
edges and the respective intensity gradient should allow
convergence.
Using this cost function, standard gradient descent
can be applied in order to maximize the alignment score.
The parameters of the function are the rotation and trans-
lation components of the 4× 4 transformation matrix used
to transform the LIDAR coordinate system to the camera
coordinate system. Knowing the camera intrinsic param-
eters, the LIDAR points can be projected onto the im-
age plane and the cost function can be evaluated. Since
the cost function cannot be analytically derived, numer-
ical derivative estimation is used instead. A numerical
derivative has to be calculated for each of the parameters
at each optimization step. We use a central difference for-
mula that approximates the derivative of our cost function
as follows:
f
′ (x)≈ f(x +h)− f(x− h)
2h
(2)
370
whereh is the step size.
The transformation matrix is formulated as a combi-
nation of the rotation matrixR and translation vectorT
and as such has 6 free parameters if we encode the ro-
tation using Euler angles. The set of parameters for one
LIDAR-camera pair can thus be represented as:
Ω = [ x,y,z,ϑ,φ,ψ ] (3)
where the Euler angles that describe the rotation about
each of the three axes areϑ , φ , ψ , which represent roll,
pitch and yaw, respectively.
Since a solution for a single viewpoint can be only
locally correct, using several images with the target in
different positions relative to the camera helps to gener-
alize the solution. The approach is similar to using mini-
batches in deep learning, as gradients for each image can
be calculated at each step and the mean gradient can be
used to update the solution.
3.4 Camera-camera calibration
Estimation of relative camera positions is a difficult task
usually reserved for well set-up cases such as stereo sys-
tems where only offset on one axis is expected. Relative
position estimation for arbitrary camera positions can be
done by detecting a known object, calculating the essen-
tial matrix and decomposing it. But the procedure is not
very numerically stable, nor is it guaranteed to produce
the real calibration parameters even when using multiple
target positions. Instead, we obtain the relative camera
positions transitively after the respective LIDAR-camera
calibrations have been completed. Given that each of
LIDAR-camera calibrations is estimated using absolute
distance measurements projected to the image, the mar-
gin for error is much smaller than if target position esti-
mates were used. LetC
1
andC
2
be coordinate systems
of cameras we wish to align andL be the LIDAR coor-
dinate system. We will use the notation
A
H
B
to signify
the transformation from coordinate system A to coordi-
nate systemB. When calibratingC
1
andC
2
we calculate
L
H
C1
and
L
H
C2
respectively. We can formulate relative
positions of cameras as follows:
C1
H
C2
= (
L
H
C1
)
− 1
L
H
C2
. (4)
The mapping of pixels between images can then be
realized by inverse projection as follows:
  ˜ X
˜Y
˜ Z
  =K
− 1
  u
v
1
  , (5)
where K is the camera calibration matrix and u,v are
the pixel positions. The inverse projection produces 3D
coordinates
˜ X,
˜Y and
˜ Z which are independent of the
camera focal length. If this calculation is performed for
all image pixels, this essentially creates a plane at dis-
tance
˜ Z = 1 in 3D space which, when projected back
usingK, perfectly reconstructs the original image. If we
transform these points (call them
˜I) to a different coor-
dinate system, we can project them onto the image plane
C
1
P
P’
C
2
d
p
1
p
2
p
2
’
1
Figure 2: Depiction of image alignment process. If cor-
rect depth d of 3D point P is not known, correct pixel
mapping can be only approximate, even if the calibration
between cameras is known.
of another camera. However, this mapping will only be
correct for objects at distance 1 from the original camera.
Since points
˜I effectively lie on rays connecting some 3D
point at unknown depth to the camera center, they can
be multiplied by an arbitrary number, thus changing the
simulated depth of the points with respect to the cam-
era. This allows us to create maps between cameras at
arbitrary depths. If we are only interested in observing
objects at a specific distance from the cameras, this gives
us perfect image alignment.
Given this formulation, it can be observed that each
point can be moved away from the camera center inde-
pendently. If some measurement or estimation of abso-
lute depth is available for a pixel, alignment can be gen-
eralized to produce accurate results for any depth. This
setup is shown in Figure 2, where the point p
1
on the
image plane of the camera C
1
is unprojected into 3D
space, generating the point p
′ , which is then projected
to the cameraC
2
with some error based on the depth er-
ror. If, however, a more precise measurement of absolute
distance d, such as a LIDAR sensor, is used, the posi-
tion of the original point P can be better estimated and
its projectionp
2
will be positioned correctly.
It must be emphasized that this approach requires
additional depth information if it is to provide accurate
per-pixel alignment of two or more images. But it is
geometrically the only viable option if our goal is to align
more than two sensors, where accurate beam splitter
approaches might become prohibitively difficult. While
CNN-based feature alignment is possible, it is reasonable
to employ as much accurate geometric approaches before
using more data-driven methods, which can be highly
susceptible to input data noise.
4 Experiments and results
We tested our approach on multiple modalities, but we
present only RGB-polarization camera results. After
each camera-LIDAR pair was calibrated, the images
could be mapped into a joint viewpoint, as described
371
Figure 3: Mapping between ZED RGB camera, polarization camera and LIDAR, from left to right: raw RGB image,
LIDAR data projected to RGB image, image from the polarization camera overlaid onto the RGB image.
in Section 3.4. We can then verify the correctness of
our method by comparing the position of detected target
points in the first image with the detected points in the
second image and observe the reprojection error. Across
the 15 images we selected for calibration, the average
reprojection error for RGB-polarization setup is 7.6px.
4.1 Calibration in the wild
In complex multi-modal sensor systems, geometric de-
calibration due to slight sensor displacement or rotation
while in the field is unfortunately a realistic scenario. In
this case, the calibration target is not at hand, and cannot
be used to correct the decalibration.
However, note that our approach relies on detecting
and segmenting a single object (target) in multiple
modalities. In the laboratory this could be a target of
known dimensions with standard dot pattern, in the wild
it could be any object that can be segmented in multiple
modalities. We only present a visual illustration of this
approach in Fig. 4. Note that the trees on the left side are
perfectly segmented in LIDAR space, and slightly worse
on the right. Semantic segmentation on RGB data is not
perfect though, but this could be markedly improved
using algorithms targeted to water environment, such as
WaSR [19].
5 Conclusion
The presented calibration method solves multiple prob-
lems of multi-modal, multi-sensor calibration and works
well in the laboratory conditions. We plan to extend this
concept to the calibration in outdoor, target-less environ-
ment, where features from multiple deep architectures
could be used to the same goal: finding reliable targets
(objects) in images and LIDAR point clouds, using
LIDAR distance measurements and proceeding with the
proposed approach that works well in the lab.
Zahvala
To delo je bilo financirano s strani Javne agencije za
raziskovalno dejavnost (ARRS), v okviru programa
[P2-0095], in raziskovalnega projekta [J2-2506].
References
[1] H. Takahashi and F. Tomita, “Self-calibration of stereo
cameras,” in 1988 Second International Conference on
Computer Vision. IEEE, 1988, pp. 123–128.
Figure 4: Multisensor data on the Ljubljanica river. Top:
LIDAR data overlaid on RGB image. Bottom: seman-
tic labels, obtained by SegFormer [18], overlaid on RGB
image.
[2] E.-S. Kim and S.-Y . Park, “Extrinsic calibration of a
camera-lidar multi sensor system using a planar chess-
board,” in 2019 Eleventh International Conference on
Ubiquitous and Future Networks (ICUFN). IEEE, 2019,
pp. 89–91.
[3] Z. Pusztai, I. Eichhardt, and L. Hajder, “Accurate cal-
ibration of multi-lidar-multi-camera systems,” Sensors,
vol. 18, no. 7, p. 2139, 2018.
[4] K. Banerjee, D. Notz, J. Windelen, S. Gavarraju, and
M. He, “Online camera lidar fusion and object detection
on hybrid data for autonomous driving,” in 2018 IEEE
Intelligent Vehicles Symposium (IV). IEEE, 2018, pp.
1632–1638.
[5] J.-K. Huang and J. W. Grizzle, “Improvements to target-
based 3d lidar to camera calibration,” IEEE Access, vol. 8,
pp. 134 101–134 110, 2020.
[6] C. Guindel, J. Beltr´ an, D. Mart´ ın, and F. Garc´ ıa, “Auto-
matic extrinsic calibration for lidar-stereo vehicle sensor
setups,” in 2017 IEEE 20th International Conference on
372
Intelligent Transportation Systems (ITSC). IEEE, 2017,
pp. 1–6.
[7] G. Pandey, J. R. McBride, S. Savarese, and R. M. Eu-
stice, “Automatic targetless extrinsic calibration of a 3d
lidar and camera by maximizing mutual information,” in
Twenty-Sixth AAAI Conference on Artificial Intelligence,
2012.
[8] M.
´
A. Mu˜ noz-Ba˜ n´ on, F. A. Candelas, and F. Torres, “Tar-
getless camera-lidar calibration in unstructured environ-
ments,” IEEE Access, vol. 8, pp. 143 692–143 705, 2020.
[9] Y . Zhu, C. Li, and Y . Zhang, “Online camera-lidar cali-
bration with sensor semantic information,” in 2020 IEEE
International Conference on Robotics and Automation
(ICRA). IEEE, 2020, pp. 4970–4976.
[10] G. Iyer, R. K. Ram, J. K. Murthy, and K. M. Krishna,
“Calibnet: Geometrically supervised extrinsic calibration
using 3d spatial transformer networks,” in 2018 IEEE/RSJ
International Conference on Intelligent Robots and Sys-
tems (IROS). IEEE, 2018, pp. 1110–1117.
[11] X. Jia, C. Zhu, M. Li, W. Tang, and W. Zhou, “Llvip:
A visible-infrared paired dataset for low-light vision,” in
Proceedings of the IEEE/CVF International Conference
on Computer Vision, 2021, pp. 3496–3504.
[12] S. Hwang, J. Park, N. Kim, Y . Choi, and I. So Kweon,
“Multispectral pedestrian detection: Benchmark dataset
and baseline,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2015, pp. 1037–
1045.
[13] J. Rangel, S. Soldan, and A. Kroll, “3d thermal imaging:
Fusion of thermography and depth cameras,” in Interna-
tional Conference on Quantitative InfraRed Thermogra-
phy, vol. 3, 2014.
[14] S. S. Shivakumar, N. Rodrigues, A. Zhou, I. D. Miller,
V . Kumar, and C. J. Taylor, “Pst900: Rgb-thermal cali-
bration, dataset and segmentation network,” in 2020 IEEE
International Conference on Robotics and Automation
(ICRA). IEEE, 2020, pp. 9441–9447.
[15] C. Walters, O. Mendez, M. Johnson, and R. Bowden,
“There and back again: Self-supervised multispectral
correspondence estimation,” in 2021 IEEE International
Conference on Robotics and Automation (ICRA). IEEE,
2021, pp. 5147–5154.
[16] Y . Wang and D. Wijesekera, “Pixel invisibility: Detect-
ing objects invisible in color images,” arXiv preprint
arXiv:2006.08383, 2020.
[17] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En-
zweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele,
“The cityscapes dataset for semantic urban scene under-
standing,” in Proceedings of the IEEE conference on com-
puter vision and pattern recognition, 2016, pp. 3213–
3223.
[18] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Al-
varez, and P. Luo, “Segformer: Simple and efficient de-
sign for semantic segmentation with transformers,” Ad-
vances in Neural Information Processing Systems, vol. 34,
pp. 12 077–12 090, 2021.
[19] B. Bovcon and M. Kristan, “A water-obstacle separation
and refinement network for unmanned surface vehicles,”
in 2020 IEEE International Conference on Robotics and
Automation (ICRA). IEEE, 2020, pp. 9470–9476.