ERK'2022, Portorož, 402-405 402 Towards fast lighting condition inference for augmented reality Leon Modic and Luka ˇ Cehovin Zajc University of Ljubljana, Faculty of computer and information science E-pošta: lm4903@student.uni-lj.si, luka.cehovin@fri.uni-lj.si Abstract Augmented reality (AR) merges real world with digital content, a process which requires information about the observed scene. In this paper we address a part of this task by exploring fast inference of a single light setup in common AR scenarios using deep learning. We propose a new synthetic dataset for training deep models for the task of light condition inference as well as a dataset of photos used for testing models in real environment. Using our datasets we have compared existing approaches that vary in light position information representation. Addi- tionally, we propose an alternative representation that could extend to encoding multiple lights. We discuss the differences between evaluated models and provide some ideas for further research in this field. 1 Introduction Augmented reality aims to blend real world with digi- tal content which requires information about the scene, primarily its structure and camera position. One of ad- ditional visual cues that makes the superimposed digital content appear more realistic are matching lighting con- ditions. While lighting is not necessary for all AR ap- plications (e.g. information overlays), other applications like gaming may benefit from it. In these scenarios light- ing may make or break the illusion as illustrated in Fig- ure 1. Real-world scene lighting is a complex phenomenon, but the parameters that are inferred for AR are firstly lim- ited by the real-time graphic engines which operate with a number of primitive light sources with different intensity. Secondly, these parameters have to be further simplified due to real-time and low power consumption constraints. To this end, most research work in this area focuses on de- termining position of a single light source relative to the camera using regression of angle parameters. In this pa- per we continue the investigation by proposing a new syn- thetic dataset for training deep models and a real dataset for evaluation. We also show that superior performance can be achieved by treating the problem as a classifica- tion problem where individual classes denote discretized angles. We believe that this formalization can be easily extended to multiple light sources. (a) (b) Figure 1: Comparison of lighting setups. Image (a) shows a generic render of a cube where no information about the light is known. Image (b) shows a render of the same cube with light conditions matching those in the scene. Both examples are rendered using a fast OpenGL renderer. 2 Related Work There have been several methods for illumination esti- mation proposed in the past years, each with different as- sumptions and goals. Most early approaches utilize light probes and/or objects with known geometry, e.g. in [1] an object with known geometry and Lambertian reflectance material assumption is used to predict position of mul- tiple light sources. Light probes are used in AR context in [2], in [3] a known, globally convex, object is also used as a light probe to determine position and intensity of a light. Combining both ideas [4] proposes a deep learning model trained on synthetic versions of a known model that is used as a light probe in to determine lighting condi- tions in real images. In [5] authors propose an interactive method for light estimation without known geometry, but require user to trace the silhouette of an object. In [6], au- thors use head-mounted camera to classify lighting con- ditions into a predefined number of illumination classes based on illumination of hands. Direct parameter inference using deep models has be- en recently explored by [7]. Synthetic RGB-D images are used to train prediction of direction of a single light in an AR scene. Their work has been extended by [8], they work with RGB images and use stereoscopic projec- tion to encode light direction. Our work follows these re- cent developments. We use synthetic RGB data for train- ing the model and a real dataset for testing. In contrast to [8] our datasets are more diverse and contain more than 403 Figure 2: Examples from synthetic training (above) and real- world testing (below) datasets. one object in various spatial configurations. We are also proposing a new way to encode light position that has po- tential to generalize to multiple light sources. 3 Datasets As mentioned, recent works [7, 8] utilize synthetic data- sets for training their models, but they have so far not released their datasets to the research community. The datasets are also quite limited in diversity of scenes. We have therefore built our own dataset to address these is- sues. The dataset is generated in Blender using Cycles ray-tracing engine. Scenes, materials, light and camera position are generated using pseudo random generator which means that rendering of specific sample can be reproduced by controlling the seed value. At the same time the size of the dataset can be easily extended by generating new samples. For the experiment, presented in this paper, we have generated a dataset consisting of more than 60.000 samples using a single light source in combination with ambient light to mimic the diversity of real-word conditions. Several images from the dataset are shown in Figure 2. The second dataset was acquired for testing. The ac- quisition was done using a Canon EOS700D camera in a controlled environment with a single dominant light source. The position of a camera and position of the light source was monitored determined using fiducial markers. By also changing objects in the scene, we have acquired 100 images, some of those shown in Figure 2. We have also created an online annotation tool that allows us to Figure 3: Light angle encoding approaches. (a) Direct encoding using radians - RAD, (b) Stereographic projection - STR, (c) Angle discretization - DIS. manually to determine position of light using slider con- trols and a visualization of an shaded object as a feed- back. We intend to use this tool to annotate in-the-wild samples. So far we have used it to re-annotate half of the testing dataset in order to evaluate human precision for the same task that is expected from evaluated models. 4 Methodology Based on our previous preliminary experiments we have selected EfficientNet [9] architectures for all our experi- ments with the proposed dataset. The model receives a RGB image of 128× 128 pixels as an input. For the out- put we have first implemented two angle regression ap- proaches, regression of radian angles of light relative to the camera [7], denoted as RAD, and stereographic pro- jection of the angles [8], denoted as STR. We have re- lied on the description in corresponding papers and only changed the backbone part. Both reference approaches are illustrated in Figure 3 and make certain assumptions regarding the scene that make sense in AR scenarios. The φ angle denotes the azimuth angle and spans from 0 to 360 degrees, whileθ denotes elevation angle and only spans from 0 to 90 de- grees. The advantage of stereographic projection of an- gles is supposed to disentangle interaction of both angles when the θ approaches 90 degrees and estimation of φ becomes difficult. The reference approaches [7, 8] have a problem that they only model a single light source. Every extension of the model to multiple lights or addition of ambient light requires modification of the architecture. There is also a problem of selecting appropriate loss function that has to take these interdependent factors into account. We are therefore exploring an option of formalizing lighting con- ditions as a discretized probability function. For a single light source the problem can be reduced to a separable multi-class problem where the interval of possible angle values is linearly discretized to a certain number of op- tions. The desired output is then encoded using one-hot 404 Model Random (1) ImageNet (2) R A φ θ φ θ A RAD B3 30,5± 32,8(21,2) 20,5± 13.3(19,1) 24,4± 22,4(16,9) 23,6± 11,4(24,4) B STR B3 62,8± 48,9(49,2) 20,1± 12,6(20,9) 51,5± 46,5(28,3) 18,9± 12,1(17,8) C DIS B3 (64, 64) 37,7± 41,4(22,2) 21,5± 14,5(19,3) 23,8± 29,3(14,6) 25,6± 11,7(24,4) D DIS B3 (32, 16) 19,0± 21,8(12,0) 19,6± 12,2(19,0) 20,3± 17,3(15,5) 22,6± 10,3(22,4) E DIS B0 (32, 16) 29,9± 32,9(19,3) 21,1± 12,6(19,9) 21,5± 26,4(14,9) 22,1± 10,8(22,2) Table 1: Overview of the experiment results. Each model is represented with a letter from A to E and a number 1 or 2. Model hyperparameters are defined with angle representation (R) and architecture (A). For discretized angles approach we have evaluated several configurations, two different discretization resolutions and two different models (EfficientNet B3 and B0). Number pre- sented are angle errors on testing dataset (smaller is better) converted to degrees. The first number is average error, followed by variance, the third number is median error. encoding. Note that this approach introduces some level quantization noise. This noise could be mitigated using interpolation, but at the moment we have only evaluated this simple approach. 5 Results We have evaluated all described approaches in two dif- ferent contexts, we have used EfficientNet B3 as the ba- sis, once with randomly initialized parameters, in sec- ond case model parameters were initialized using a model pre-trained on ImageNet dataset. All models were trained using Adam [10] optimization algorithm. Learning rate was set to 0.0002 for randomly initialized models vari- ants and to0.0001 for the pre-trained variants. Batch size was set to 32 samples. According to [7, 8] we used MSE loss function for RAD and STR approaches. For DIS, we have used standard cross entropy loss. We have run training for all the models for 100 epochs. The results of our experiment are presented in Ta- ble 1, all errors are given in degrees. The errors for all models look quite high, but to put the numbers into per- spective we have measured average error of human anno- tations for a part of the testing dataset which are 24.9 for φ and 7.3 forθ . It is also clear that the inter-sample vari- ation of error is quite high. We have therefore also com- puted median angle error which is almost always lower than the average. This shows that the distribution is ske- wed - we have many samples with low error and a few samples with very high error as shown in Figure 4 where we look at the distribution of errors for model D1. The errors according toφ are indeed not distributed normally. The θ errors are more evenly distributed and are appar- ently harder to infer. Reference models (A and B) are performing worse than reported in corresponding papers [7, 8], also due to a more difficult setup with more objects in the scene as well as textures. The best models according to the testing set results are the discretized position models, primarily D1 and D2, despite quantization errors that occur in this rep- resentation. Increasing discretization resolution (models C1 and C2) does not help, but increases error, especially with respect toφ dimension. We attribute this to the sim- ple quantization technique with one-hot encoding, we be- lieve that distributing votes among neighbor cells may re- duce this problem. Selected examples of predictions for D1 model are shown in Figure 5, first row shows good ex- Figure 4: Distribution of errors for testing dataset for angleφ and θ according to model D1. The plots show errors for all testing samples with respect to their actual angle. Dotted green lines represent average error for each distribution. amples, second row bad ones. We can see that the model is less sure regarding theθ angle with more uncertainty in the distribution. We can also see that model tends to be less reliable when real objects are not distributed over entire image. Another thing that we have observed is that the model performs worse on the half of images that con- tain fiducial markers. All these observations give us new guidelines and ideas for the design of the training dataset. Finally, we have tested out trained model on several images that we have acquired in-the-wild with more di- verse and different light sources. Figure 6 contains sev- eral examples where the model behaves well despite weak shading and reflections, leading us to believe that a more general and robust model can be obtained with more work. 6 Conclusion In this paper we have presented our preliminary results on fast inference of a single light setup in common AR scenarios using deep learning. We have proposed a new synthetic dataset for training deep models for the task of light condition inference as well as a dataset of real im- ages used for testing models in real environment. Both datasets are meant to become publicly available, thus ben- efiting the progress of the field. Using these datasets we have evaluated several approaches and proposed a novel representation that could be easily extended to encode 405 (a) (b) (c) (d) (e) (f) Figure 5: Qualitative example overview for model D1. Prediction distribution for φ is presented on the left chart and for θ on the right chart. Examples (a) to (c) show results with good prediction of both angles, examples (d) to (f) show cases with severe problems. Reference object rendered using an OpenGL based engine, shadow intensity a default value and it not predicted by the model. Figure 6: Preliminary result for D1 model on samples recorded outside the controlled environment. Only position is determined by the model, other light properties are set to default values. multiple lights. Our future work will involve improving and extending the datasets as well as exploring ways to determine multiple light sources in a scene and working with multiple viewpoints to increase robustness. Acknowledgements: This research was partially funded by ARRS research program P2-0214. References [1] Yang Wang and Dimitris Samaras. Estimation of multiple directional light sources for synthesis of augmented reality images. Graphical Models, 65(4):185–205, July 2003. [2] K. Agusanto, Li Li, Zhu Chuangui, and Ng Wan Sing. Photorealistic rendering for augmented reality using en- vironment illumination. In The Second IEEE and ACM International Symposium on Mixed and Augmented Real- ity, 2003. Proceedings. IEEE Comput. Soc, 2003. [3] Rang M. H. Nguyen and Minh Ngoc Le. Light source estimation from a single image. In 2012 12th Interna- tional Conference on Control Automation Robotics & Vi- sion (ICARCV). IEEE, December 2012. [4] David Mandl, Kwang Moo Yi, Peter Mohr, Peter M. Roth, Pascal Fua, Vincent Lepetit, Dieter Schmalstieg, and De- nis Kalkofen. Learning lightprobes for mixed reality il- lumination. In 2017 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, October 2017. [5] Jorge Lopez-Moreno, Sunil Hadap, Erik Reinhard, and Diego Gutierrez. Light source detection in photographs. In CEIG, pages 161–167, 2009. [6] Bruno Augusto Dorta Marques, Rafael Rêgo Drumond, Cristina Nader Vasconcelos, and Esteban Walter Gonzalez Clua. Deep light source estimation for mixed reality. In VISIGRAPP, 2018. [7] Peter Kán and Hannes Kafumann. DeepLight: light source estimation for augmented reality using deep learn- ing. The Visual Computer, 35(6-8):873–883, May 2019. [8] Markus Miller, Alfred Nischwitz, and Rüdiger Wester- mann. Deep light direction reconstruction from single RGB images. In WSCG 2021 Proceedings, 2021. [9] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In In- ternational conference on machine learning, pages 6105– 6114. PMLR, 2019. [10] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.