Proceedings of the 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia February 3-5, 2016 Proceedings of the 21st Computer Vision Winter Workshop February 3-5, 2016, Rimske Toplice, Slovenia © Slovenian Pattern Recognition Society, Ljubljana, February 2016 Volume Editors: Luka Čehovin, Rok Mandeljc, Vitomir Štruc Publisher Slovenian Pattern Recognition Society, Ljubljana 2016 Electronic edition Slovenian Pattern Recognition Society, Ljubljana 2016 © SDRV 2016 CIP - Kataložni zapis o publikaciji Narodna univerzitetna knjižnica, Ljubljana 004.93(082)(086.034.4) 004.8(082)(086.034.4) COMPUTER Vision Winter Workshop (21 ; 2016 ; Rimske Toplice) Proceedings of the 21st Computer Vision Winter Workshop, Rimske Toplice, Slovenia, February 3-5, 2016 [Elektronski vir] / Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.). - Electronic ed. - Ljubljana : Slovenian Pattern Recognition Society, 2016 ISBN 978-961-90901-7-6 1. Čehovin, Luka 283229440 Maloprodajna cena: 19,99 € Message from the program chairs It is our pleasure and privilege to welcome you to the 21st Computer Vision Winter Workshop (CVWW2016). This year the workshop is organized by the Slovenian Pattern Recognition Society (SPRS), and held in Rimske Toplice, Slovenia, from of February 3rd to February 5th, 2016. We hope that your experience at CVWW is both professionally and personally rewarding! The Computer Vision Winter Workshop (CVWW) is an annual international meeting of several computer vision research groups, located in Ljubljana, Prague, Vienna, and Graz. The aim of the workshop is to foster interaction and exchange of ideas among researchers and PhD students. The focus of the workshop spans a wide variety of computer vision and pattern recognition topics, such as image analysis, medical imaging, 3D vision, human-computer interaction, vision for robotics, machine learning, as well as applied computer vision and pattern recognition. CVWW 2016 received a total of 23 submissions from six countries. The paper selection was coordinated by the Program Chairs, and included a rigorous double-blind review process. The international Technical Program Committee consisted of 39 renowned computer vision experts, who conducted the review. Each submission was examined by at least three experts, who were asked to comment on the strengths and weaknesses of the papers and justify their recommendation for accepting or rejecting a submission. The Program Chairs used the reviewers' comments to render the final decision on each paper. As a result of this review process, 8 papers were accepted for oral presentation, and 6 papers were accepted for presentation in the form of a poster. Authors of the accepted posters were also given the opportunity to present their work in the form of short one-minute talks at a designated spotlight session. 8 papers were accepted for presentation at the workshop in the form of invited presentations of on-going work (6 orals and 2 posters), and are not included in the proceedings to avoid conflicts with potential future submissions of the presented material. The Program Chairs would like to thank all reviewers for their high-quality and detailed comments, which served as a valuable source of feedback for all authors, and most of all for their time and effort, which helped to make the CVWW2016 a success. The workshop program included an invited talk by dr. Mario Fritz (Laboratory for Autonomous Intelligent Systems, Department of Computer Science, University of Freiburg), to whom we thank for his participation. We also extend our thanks to the Slovenian Pattern Recognition Society, through which the workshop was organized. CVWW 2016 benefits from its sponsors; and we want to acknowledge and thank our supporters from KOLEKTOR and the Faculty of Computer and Information Science for their contributions. To all the sponsors and their representatives in attendance, thank you! We hope that the 21st iteration of the Computer Vision Winter Workshop is a productive and enjoyable meeting for you and your colleagues, and inspires new ideas that can advance your professional activities. Welcome and thank you for your participation! Luka Čehovin, Rok Mandeljc, Vitomir Štruc CVWW2016 Program Chairs Ljubljana, Slovenia, January 2016 Committes PROGRAM CHAIRS Luka Čehovin (FRI University of Ljubljana) Rok Mandeljc (FRI, FE University of Ljubljana) Vitomir Štruc (FE University of Ljubljana) PROGRAM COMMITTEE Csaba Beleznai Stanislav Kovacic Rene Ranftl Horst Bischof Matej Kristan Daniel Prusa Jan Cech Walter Kropatsch Peter Roth Ondrej Chum Vincent Lepetit Robert Sablatnig Ondrej Drbohlav Jiri Matas Radim Sara Boris Flach Martin Matousek Walter Scheirer Vojtech Franc Mirko Navara Alexander Shekhovtsov Friedrich Fraundorfer Tomas Pajdla Danijel Skocaj Margrit Gelautz Peter Peer Tomas Svoboda Michal Havlena Roland Perko Peter Ursic Yll Haxhimusa Janez Pers Tomas Vojir Václav Hlaváč Roman Pfugfelder Andreas Wendel Ines Janusch Thomas Pock Paul Wohlhart Contents 1. Towards a Visual Turing Test: Answering Questions on Images (invited talk) [Abstract] Mario Fritz 2. A Longitudinal Diffeomorphic Atlas-Based Tissue Labeling Framework for Fetal Brains using Geodesic Regression [PDF] Roxane Licandro*, Georg Langs, Gregor Kasprian, Robert Sablatnig, Daniela Prayer, and Ernst Schwartz (Vienna University of Technology) 3. Quantitative Comparison of Feature Matchers Implemented in OpenCV3 [PDF] Zoltan Pusztai (Eörvös Loránd University) and Levente Hajder* (MTA SZTAKI) 4. Real-Time Eye Blink Detection using Facial Landmarks [PDF] Tereza Soukupova* and Jan Cech (Czech Technical University in Prague) 5. Solving Dense Image Matching in Real-Time using Discrete-Continuous Optimization [PDF] Alexander Shekhovtsov*, Christian Reinbacher, Gottfried Graber, and Thomas Pock (Graz University of Technology) 6. Touching without vision: terrain perception in sensory deprived environments [PDF] Vojtěch Šalanský*, Vladimír Kubelka, Karel Zimmermann, Michal Reinštein, and Tomas Svoboda (Czech Technical University in Prague) 7. Hessian Interest Points on GPU [PDF] Jaroslav Sloup, Jiri Matas, Michal Perdoch, Stepan Obdrzalek* (Czech Technical University in Prague) 8. BaCoN: Building a Classifier from only N Samples [PDF] Georg Waltner*, Michael Opitz, Horst Bischof (Graz University of Technology) 9. Cuneiform Detection in Vectorized Raster Images [PDF] Judith Massa, Bartosz Bogacz*, Susanne Krömker, Hubert Mara (University Heidelberg) 10. 2D tracking of Platynereis dumerilii worms during spawning [PDF] Daniel Pucher*, Walter Kropatsch, Nicole Artner, Stephanie Bannister (Vienna University of Technology) 11. Significance of Colors in Texture Datasets [PDF] Milan Šulc*, Jiri Matas, (Czech Technical University in Prague) 12. A Novel Concept for Smart Camera Image Stitching [PDF] Hanna Huber, Majid Banaeyan*, Raphael Barth, Walter Kropatsch (Vienna University of Technology) 13. A concept for shape representation with linked local coordinate systems [PDF] Manuela Kaindl*, Walter Kropatsch (Vienna University of Technology) 14. A Computer Vision System for Chess Game Tracking [PDF] Can Koray*, Emre Sumer (Başkent University) 15. Fast L1-based RANSAC for homography estimation [PDF] Jonáš Šerých*, Jiri Matas, Ondrej Drbohlav (Czech Technical University in Prague) Invited talk Towards a Visual Turing Test: Answering Questions on Images Mario Fritz Max Planck Institute for Informatics and Saarland University Abstract We address the task of automatically answering questions on images by bringing together latest advances from natural language processing and computer vision. In order to quantify progress on this challenging problem, we have established the first benchmark for this challenging problem that can be seen as a modern attempt at a visual turing test. Our first approach to this problem follows a more traditional AI approach, where we combine discrete reasoning with uncertain predictions by a multi-world approach that models uncertainty about the perceived world in a bayesian framework. More recently, we build on the success of deep learning techniques and propose an end-to-end formulation of this problem for which all parts are trained jointly. Looking forward, we see these two approach as two ends of a spectrum ranging from symbolic representations to vector-based embedding that we are currently exploring. Sponsors 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 A Longitudinal Diffeomorphic Atlas-Based Tissue Labeling Framework for Fetal Brains using Geodesic Regression Roxane Licandro1,2 licandro@caa.tuwien.ac.at 1 Institute of Computer Aided Automation, Computer Vision Lab, Vienna University of Technology, http://www.caa.tuwien.ac.at/cvl 2 Department of Radiology and Image-guided Therapy, Computational Imaging Research Lab, Medical University of Vienna, http://www.cir.meduniwien.ac.at Georg Langs2 Gregor Kasprian2 Robert Sablatnig1 Daniela Prayer2 Ernst Schwartz2 Abstract. The human brain undergoes structural tal brain during pregnancy, a single map is not suf- changes in size and in morphology between the sec- ficient to model brain development [19]. Changes ond and the third trimester of pregnancy, corre- in size, according to accelerated growth, changes in sponding to accelerated growth and the progress of morphology, due to the progress of cortical folding cortical folding. To make fetal brains comparable, and deceleration of the proliferation of ventricular spatio-temporal atlases are used as a standard space progenitor cells [16] occur and are illustrated in Fig- for studying brain development, fetal pathology loca- ure 1a. Thus, a collection of brain maps is needed tions, fetal abnormalities or anatomy. The aim of this to describe these alterations as a function of time. work is to provide a continuous model of brain devel- For studying the brain organisation during its de- opment and to use it as base for an automatic tissue velopment, abnormalities, and locations of patholo- labeling framework. This paper provides a novel lon- gies, brain maps are used as a reference model [18]. gitudinal fetal brain atlas construction concept for Newly acquired brain images are labelled to iden- geodesic image regression using three different age- tify structures and possible abnormal changes or to ranges which are parametrized according to the de- find indicators for diseases. This labeling can be per- velopmental stage of the fetus. The dataset used for formed manually by annotating the images, which evaluation contains 45 T2−weighted Magnetic Res- needs an expert, time and consequently leads to in- onance (MR) volumes between Gestation Week (GW) creased costs compared to an automatic labeling pro- 18.0 and GW 30 day 2. The automatic tissue label- cedure [3]. In this case, labels for non annotated ing framework estimates cortical segmentations with images are estimated automatically by software us- a Dice Coefficient (DC) of up to 0.85 and ventricle ing a brain model for the mapping. Such an auto- segmentations with a DC of up to 0.60. mated labeling procedure on the one hand and a ref- erence model on the other form an atlas. To cover the time-dependent development of the fetal brain, time- 1. Introduction varying reference models are considered for building spatio-temporal atlases. The aim of brain mapping experiments is to cre- ate maps (models), based on studies, to understand 1.1. State-of-the-Art structural and functional brain organization. To this end, neuroimaging methods as well as knowledge of State-of-the-art approaches [8, 10, 13, 17, 21] for neuroanatomy and physiology are combined. Due to computing a spatio-temporal atlas combine registra- the fundamental changes occurring in the human fe- tion methods and interpolation techniques to obtain (a) Fetal Brain Development (b) Observable Brain Structures Figure 1: Left: MR imaging and schematic illustration of the fetal brain development at GW 20, 23 day 3 and 30 day 2. Right: Illustration of identifiable brain structures in a T2 weighted fast MR image acquired with a 1.5 Tesla scanner (Grey Matter (GM), White Matter (WM), the VENTricles (VENT) and the Germinal MATrix (GMAT) [21]). Also extraventricular Cerebro Spinal Fluid (CSF), Deep Grey Matter (DGM) and Non-Brain structures (NB), like skull or amniotic fluid are identifiable. MR images courtesy of Medical University of Vienna (MUW). continuity in time. The use of an ”all-to-one” ap- imaging technique is used as an alternative to ultra- proach (a single subject as reference) introduces sub- sonography for prenatal diagnosis and is able to im- stantial bias. The brain structures of fetuses can- age a fetus in a non-invasive way. Distinguishable not be described by one image, since it does not re- brain structures using this technique are illustrated in flect occurring changes over time [10, 17]. Exclu- Figure 1b. A problem of MR imaging is the lack of sive pairwise affine registration for image alignment comparability and constancy of gray-values. Thus, results in blurred regions in the templates obtained for the comparison of brains of adult patients, an at- by intensity averaging. Affine registration is not ca- las as a standard space is required, which avoids the pable of compensating local inter-subject variabil- gray-value discrepancies. The brains are mapped to a ity [17]. This leads to worse registration results be- standardized coordinate system according to marked tween atlas-based segmentations and individual ob- anatomical locations. However, the fetal brain is a jects compared to non-rigid approaches, which show developing structure. In comparison to building an a higher level of detail [17]. An advantage of pair- atlas of an adult brain, the fast change of a fetal brain wise approaches lies in the registration of wider age- in shape and size has to be taken into account [10]. ranges between 15 to 18 Gestation Weeks (GW), Also, fetal brains at a certain GW show differences in compared to groupwise approaches, which are able orientation shape and size. Possible reasons are the to cover only small age ranges between 5 to 8 GW. inaccuracy in determination of the gestational age, A benefit of groupwise registration approaches is inter-patient variability or pathological growth pro- the template-free estimation of the initial reference cesses [15]. The motivation for building a fetal atlas space. The template is estimated and updated dur- is the possibility to compare fetal brains for study- ing the registration procedure [10]. The main limita- ing brain development, fetal pathology locations, fe- tions of groupwise registration lie in the lower level tal abnormalities or anatomy. of anatomic definition [17]. Examples for pairwise approaches can be found in [10, 17] and for group- 1.3. Contribution wise approaches in [8, 13, 21]. We create a tissue labeling framework for corti- cal and ventricle structures in the fetal brain from 1.2. Challenges GW 18 to GW 30. An automatic segmentation pro- Imaging of a fetus in utero is challenging, due to cedure including a longitudinal fetal brain atlas and its constantly changing position, which causes image a labeling procedure are considered. In our work unsharpness and artefacts [5]. Thus, a main issue in we demonstrate that image regression is capable to fetal imaging lies in shortening the image acquisi- build a spatio-temporal atlas of the fetal brain and tion time to 20 seconds and to use motion correc- is able to model a mean trajectory encoding the tion techniques [4]. The Magnetic Resonance (MR) brain development in a single diffeomorphic defor- mation, instead of calculating discrete age-dependent tion = 0.78 - 0.9 pixels per mm, Slice thickness = templates combined with interpolation. As found 3 - 4.4mm, Acquisition matrix = 256×256, Field in literature [7, 9, 11], image regression for time- of view = 200 - 230mm, Specific Absorption Rate series data have been evaluated only using adult- and (SAR) = < 100% /4.0W/kg, Image acquisition time child-brain datasets, which record changes of brain = ≤ 20s, TE (Echo Time) = 100 - 140ms, TR (Rep- structure over time. In the proposed work the lo- etition Time) = 9000 - 19000ms. The dataset of MR cal inter-subject variability is considered to be mod- images used for atlas learning are preprocessed using elled continuously in time and non-rigidly in space the pipeline illustrated in Figure 2. First the images by geodesic regression [1, 2]. The computed atlas are motion corrected using the toolkit for fetal brain is used as a prior of the Graph Cut (GC) approach MR images published by Rousseau et al. [14]. Sub- for multi label segmentation proposed by Yuan et sequently, the manual annotation of the cortex, left al. [20]. and right eye, ventricle and occipital foramen mag- The paper is organized as follows. In Section 2 an num is performed by an expert. After this step, the overview of the methodology used and the concept data is rigidly aligned, the surrounding mother tissue of the tissue labeling framework proposed is pre- is excluded in a masking step and the volumes are sented. The results and the corresponding discussion cropped to reduce computational costs in the longi- are given in Section 3. This work concludes with a tudinal registration procedure using a bounding box summary of the contributions in Section 4. of size 90 × 140 × 140 voxels. 2.2. Spatio Temporal Atlas Learning 2. Methodology The algorithm used for Diffeomorphic Anatom- The framework proposed is illustrated in Figure 2. ical RegistraTion using Exponential Lie algebra The input represents a gray value image Inew at time (DARTEL) of Ashburner et al. [1, 2] for geodesic point tnew, which is preprocessed in a first step, by regression is integrated in the Statistical ParaMetric performing motion correction, rigid alignment, im- (SPM) tool box - release SPM8 1. This approach age masking and image cropping. Subsequently, the is used to encode the brain development in a single longitudinal diffeomorphic fetal brain atlas is used to diffeomorphic deformation by optimising the energy estimate a time point tnew corresponding diffeomor- term E expressed in Equation 1 [2]. phic transformation for computing a time-dependent N intensity image I Z A and a time-dependent segmenta- 1 1 X E = kLv kI − I (ϕ )k2 dx tion for ventricular and cortical tissue Stissue in atlas 0k2+ t0 tn tn A 2 2 n=1 space. In a pairwise registration procedure, a trans- x∈Ω (1) formation T from the preprocessed input (Aligned The term ϕ denotes the forward deformation from I tn new ) to the atlas-based intensity image IA is esti- source I to target I at time point t mated. The inverse of the computed transformation t0 tn n, where n = 1, . . . , N and L represents a model of the ”inertia” T −1 is used to transform the atlas based segmenta- of the system, i.e. a linear operator which operates tions Stissue to the subject’s space (Stissue ◦ T −1 = A A on a time-dependent velocity that mediates the defor- Stissue). As next step the transformed segmentations GC mation over unit time [2]. It is introduced to derive Stissue and I GC new are used as input parameters for the an initial momentum m multi label GC segmentation refinement. The output 0 through an initial velocity v of the framework are segmentations for ventricular 0. The velocity field v(x) learned at position x is parametrised using a linear combination of i basis and cortical brain tissues Stissue new of the input image functions. Such basis functions consist of a vector Inew. of coefficients ci and a ith first degree B-spline basis 2.1. Image Acquisition and Preprocessing function ρi(x) (cf. Equation 2) [1]. X The time series MR image dataset used consists v(x) = ciρi(x) (2) of 45 healthy fetal brains with an age range between i 18 and 30 GW. The MR image acquisition is per- The aim of the DARTEL implementation is to esti- formed using an 1.5 Philips Gyroscan superconduct- mate an optimized parametrisation of c. The energy ing unit scanner performing a single-shot, fast spin- 1http://www.fil.ion.ucl.ac.uk/spm/; echo T2-weighted MR sequence: In-plane resolu- [accessed 07 December 2015] Figure 2: Fetal brain tissue labeling framework. MR images courtesy of MUW. cost term E in Equation 1 is reformulated in terms term µ encodes the variance according to symmetric of finding the coefficients of c for a given dataset D components, rotations and the penalisation of scaling with maximum probability (cf. Equation 3). A maxi- and shearing. The likelihood term encodes the prob- mization of the probability leads to the minimization ability of c given the data D [1] and corresponds to of its negative logarithm and thus, is used to interpret the mean-squared difference between a warped tem- registration of data D as a minimization procedure plate deformed by the calculated transformation and of the objective function (− log p(c, D)) expressed the target image. in Equation 3, consisting of a prior term (− log p(c)) and a likelihood term (− log p(D|c)) [1]. 2.2.1 Optimisation Procedure − log p(c, D) = − log p(c) − log p(D|c) (3) A Full Multi Grid (FMG) approach is used to solve the equation (cf. Equation 4) which is needed to up- The prior term denotes the prior probability p(c). date the vector field during a Gauß-Newton opti- Ashburner et al. [1] use a concentration matrix (in- mising procedure, where Hiter denotes the Hessian, verse of a covariance matrix) K to encode spa- giter the gradient and K the concentration matrix. tial variability. The parameters [λ Details regarding the computation of viter+1 are ex- 1, λ2, λ0, λ, µ], 0 plained in [1, 2]. which have to be predefined to compute K, influence the behaviour of the deformation (bending energy, viter+1 = viter − (K + Hiter)−1 (Kviter + giter) (4) 0 0 0 stretching, shearing) as well as the divergence and amount of volumetric expansion or contraction [1]. For this task images are observed in different scales. The term λ0 encodes the penalisation of absolute dis- For every resolution level multigrid methods recur- placements, λ1 penalises the difference between two sively estimate the field, starting at the coarsest scale neighboured vectors by observing the first derivatives and computing the residual to solve the update equa- (linear term) of the displacements, λ2 penalises the tions on the current grid. Subsequently, the solution difference between the first derivatives of two neigh- is prolongated to the next finer grid [1]. boured vectors by observing the second derivatives 2.3. Automatic Tissue Labeling using Graph Cuts of the displacements and λ denotes the variability of the spatial locations (divergence of each point in the For tissue labeling, we use a continuous max flow flow field) with a constant value. Increasing λ leads formulation of a multi label GC [20]. Three input to increasing smoothing of the flow vector field and parameters are necessary for performing tissue seg- preserves volumes during the transformation. The mentation. A data term (gray value volume Inew at age tnew), a cost (unary) term, and a penalty day 3 (164 GD) to 26 GW day 2 (184 GD) and age (binary) term. For computing a unary term, atlas range 3 from 26 GW day 2 (184 GD) to 30 GW day 2 based segmentations for cortex and ventricle tissue (212 GD). The first part of the evaluation documents Stissue = {Scortex, Sventricle} at age t are estimated the atlas learning results for each age range. Subse- and smoothed with a Gaussian filter G. The parame- quently, the atlases computed are used to evaluate the ter δ is defined to weight the smoothed result with a tissue labeling procedure as a second part of the eval- constant factor. The unary term is illustrated in Equa- uation. Estimated atlas templates at the testing time- tion 5, where ? denotes the convolution operator. point are pairwise registered to the test MR volume to obtain a transformation T . The inverse T −1 is used C = δ ∗ (Stissue ? G) (5) to transform the atlas based segmentation to the test- subject’s space. As last step the segmentation of the Three different binary terms are evaluated: test volume using the transformed atlas is computed. Penalty term 1 (P1) is a weighted norm of the gra- To evaluate our approach, we report the overlap be- dient of the data term D (cf. Equation 6), where δ tween automatic- and manual segmentations of the denotes the same weighting term as used in Equation fetal cortex and ventricles. In the leave-one-out cross 5 and a, b are constant weighting parameters. validation, we compare the Dice Coefficient (DC) [6] b between the groundtruth annotation and different au- P1 = δ ∗ (6) 1 + (a ∗ k∇Dk) tomatic segmentations based on (1) the atlas, (2) the transformed atlas, and (3) the GC segmentation opti- Penalty term 2 (P2) denotes an intensity based term mization. and is calculated separately for cortex and ventri- Furthermore, we report the volume of cortex and cle segmentation (cf. Equation 7). Tissue type spe- ventricles, and the area of the cortical surface of the cific gray values are modelled as Gaussian distribu- atlas based segmentations. tions N∼(µtissue, σtissue), which parameters µtissue and σtissue are estimated using the a-priori atlas seg- 3.1. Results Spatio-Temporal Atlas Learning mentation. These parameters are used to calculate the probability of every pixel belonging to cortex or The deformation behaviour of image regres- ventricle. Subsequently, the gradient of the resulting sion using 21 different regularisation kernels probability map P and its norm are computed and K [λ1, λ2, λ0, λ, µ] (cf. Section 2.2) is evaluated for weighted by the parameters δ, a, b as shown in Equa- every age range. Beside the DC also the behaviour tion 6. of the regularisation of the volume expansion and changes of the area of cortical surface have to be b P2 = δ ∗ (7) taken into account, when choosing a suitable ker- 1 + (a ∗ k∇P (µtissue, σtissue)k) nel. Atlas-based cortical and ventricle segmentations Penalty term 3 (P3) represents an exponential for- are studied. According to the evaluation results, ker- mulation and is expressed in Equation 8. The param- nel 1 (K1 0.01, 0.01, 9e−6, 1e−5, 1e−5) is chosen eter u is a constant and v a linear weighting parame- as suitable regularisation for age range 1, kernel 4 ter. The term w weights the norm of the image’s D (K4 0.01, 9e−6, 9e−6, 0.01, 1e−5) for age range 2 gradient non-linearly in the exponential term. and kernel 7 (K7 0.01, 0.01, 9e−6, 0.01, 1e−5) for age range 3. Figure 3a shows examples of the at- k∇Dk P las templates learned and Figure 3b illustrates the 3 = u + v ∗ exp − (8) w anatomical details of these at age GW 21 day 4 (GD 151), GW 24 day 3 (GD 171) and GW 29 (GD 203). 3. Results In both figures the growth of the brain structures is Evaluation of the proposed framework is per- observable. The brain model at age range 1 is char- formed using leave-one-out cross validation. In this acterised by a smoother cortex surface in compari- paper a novel longitudinal registration procedure is son to a brain at a higher age range. It also visu- formulated by dividing the data set into three age alises the increase of the cortical folding grade. Ac- ranges, based on the developmental stage of the fetus. cording to Pugash et al. [12], the ventricles achieve Age range 1 reaches from 20 GW day 6 (146 GD) to their thickest size in early gestation and regress in the 23 GW day 3 (164 GD), age range 2 from 23 GW third trimester, which is not visible. The regularisa- ATLAS BASED TEMPLATES AGE RANGE 1 ATLAS BASED TEMPLATES AGE RANGE 2 KERNEL 1 KERNEL 4 GD 148 GD 150 GD 153 GD 156 GD 159 GD 163 GD 164 GD 168 GD 172 GD 177 GD 181 ATLAS BASED TEMPLATES AGE RANGE 3 KERNEL 7 GD 184 GD 190 GD 194 GD 200 GD 205 GD 208 GD 212 (a) Atlas based templates (b) Details Atlas based templates Figure 3: Left: Atlas based templates of age range 1, 2 and 3 between GW 21 day 1 (GD 148) and GW 30 day 2 (GD 212). Right: Anatomical details of atlas based templates at age GW 21 day 4 (GD 151), GW 24 day 3 (GD 171) and GW 29 (GD 203). Coronal (first row), axial (second row) and sagital (third row) slices are illustrated. Denoted structures: Sylvian Fissure (SF), InterHemispheric Fissure (IHF), Germinal MATrix (GMAT), Lateral-VENTricle (L-VENT), Cingulate Sulcus (CiS), ColLateral Sulcus (CLS), Cavum of Septum Pellucidum (CSP), Occipital Lobe (OL), Frontal Lobe (FL), Central Sulcus (CeS), PreCentral Gyrus (PreCG), PostCentral Gyrus (PostCG), ParietoOccipital Sulcus (POS) and Calcarine Sulcus (CaS). tion term for geodesic regression is not able to model (CaS) and PreOccipital Sulcus (POS). location specific volume expansion and shrinkage at 3.2. Results Automatic Tissue Labeling the same time. This leads to worse modelling results for ventricles, compared to cortical structure, since a For pairwise registration kernel A kernel is chosen which models expansion. Addition- (K A 5e−3, 5e−3, 3e−5, 1e−5, 9e−6) is used ally, the subject specific variability of age-dependent for regularisation. The DC distributions of seg- ventricle size in the dataset and the complex form of mentations of the cortex for age range 1, 2 and ventricles complicate the determination of a suitable 3 are illustrated in Figure 4 on the top and for kernel and consequently the registration procedure. ventricle segmentations on the bottom. The DC Observable structures at every age range are Sylvian distribution of atlas based and transformed atlas- Fissures (SF), Lateral VENTricle (L-VENT), Inter- based segmentations using pairwise registration are Hemispheric Fissure (IHF), Cavum of Septum Pellu- illustrated and the three dotted lines visualise the cidum (CSP), Occipital Lobe (OL) and Frontal Lobe DCs of GC based segmentations computed using (FL). The SF show in the coronal and axial slices a penalty terms 1, 2 and 3. For age range 1 the smooth bending at age range 1 and develop to a deep highest DC improvement from 0.727 to 0.771 at fold at the lateral side of the brain at age range 3. GD 158 is achieved by pairwise registration and GC Also the IHF shows a deeper folding at age range refinement compared to atlas based segmentations. 3 with Cingulate Sulcus (CiS) as additional form- In contrast to this no improvement is reached at GD ing compared to age range 1. The Germinal MATrix 151, but shows the highest DC of about 0.851. At (GMAT) is existent until age range 2 and disappears GDs older than 154 the GC refining using penalty later in the third trimester of pregnancy. The Central 1 and penalty 2 achieve a higher DC increase of Sulcus (CeS) formation starts at age range 2 and gets about 0.02 compared to using penalty 3. At age more apparent at age range 3 as well as the develop- range 2 no improvement of transformed atlas based ing of the PreCentral Gyrus (PreCG) and PostCen- segmentations is observed after pairwise registration, tral Gyrus (PostCG). The ColLateral Sulcus (CLS) is which leads to a decrease of the DC. It is observed visible at age range 3 as well as the Calcarine Sulcus that the labeling result of the pairwise registration Age Range 1 Age Range 2 Age Range 3 0,85 EX) ATLAS 0,75 PW RT GC - P1 0,65 C (COD GC - P2 0,55 GC - P3 150 151 154 158 164 164 165 170 171 180 184 184 186 191 196 197 199 203 206 208 210 0,6 ATLAS CLE) 0,4 PW TRI 0,2 GC - P1 GC - P2 C (VEND 0 GC - P3 150 151 154 158 164 164 165 170 171 180 184 184 186 191 196 197 199 203 206 208 210 GESTATIONAL DAYS Figure 4: DCs of automatically estimated labels of the cortex and ventricle at age range 1, 2 and 3. DATA ATLAS PW GC M GD 171 GD 203 Figure 5: Top: Coronal view - segmentations of the cortex at GD 171 (GW 24 day 3), bottom: sagital view - segmentations of the ventricle at GD 203 (GW 29). Segmentations are illustrated estimated by the atlas (ATLAS), after the pairwise registration procedure (PW), estimated by the GC approach (GC) and manual annotations (M). has an influence on the GC labeling since it acts as is not capable to compensate differences in volume initialization of this procedure, best visible at GD size or absolute displacements. If an estimated 184. The GC refinement is able to compensate the segmentation has a bigger volume than the structure results of the pairwise registration between GD 164 to be segmented or is displaced, then the borders of and 184 and shows an increase of the DC between neighboured tissue prevents the GC approach from atlas and graph-cut based segmentations in average cutting through regions of a high gradient, since of about 0.02. At age range 3 an increase of DC at this would lead to increasing costs in the energy every age range is achievable using GC refinement. minimisation procedure. Consequently, the GC is The highest improvement between atlas-based seg- not capable to refine the segmentation. In Figure 5 mentations and GC based segmentations is reached an example for a misaligned segmentation and its at GD 206 with a DC increase from 0.71 to 0.795. deformation through the labeling procedure is illus- The highest DC at age range 3 of about 0.819 is trated. The displacement is observable at the IHF in achieved at GD 203 and the lowest of about 0.575 at the first column and the superior part of the anterior GD 184. It is observable that pairwise registration horn of the ventricle in the second column. Test data and corresponding estimated segmentations, labeling. Finally the proposed framework is able to transformed segmentations to subject’s space and estimate cortex segmentations with a DC up to 0.85 GC based segmentations of the cortex at GD 171 and ventricle segmentations up to 0.60. We show (top) and of ventricular tissue at GD 203 (bottom) that image regression is capable to model the vari- are shown. The GC segmentations are computed ability of fetal brains in time and is qualified to be using the penalty term 3, since it shows the best used for building a spatio-temporal atlas as basis for improvement between atlas-based and GC based fetal brain tissue segmentation. The evaluation of the segmentations. cortical labeling results for age range 1, 2 and 3 show that a single kernel for pairwise registration for every 4. Conclusion age range is not suitable. Thus, a main focus of future work will lie in the improvement of the labeling pro- In this paper an automatic fetal brain tissue label- cedure, by evaluating age range and tissue dependent ing framework using geodesic image regression was regularisation, to improve the quality of graph cut presented and was identified to be suitable as regis- based segmentation. Additionally, a combination of tration approach to longitudinally model the changes global rigid and local deformable pairwise registra- of the brain during the 18th and 30th GW. The advan- tion could be analysed for transforming atlas based tage is the provision of a time-dependent transforma- segmentations to the subject’s space as extension to tion from a source to a target brain, instead of com- this work. bining a template building technique and interpola- tion technique to obtain continuity in time. A novel Acknowledgements longitudinal registration scheme was proposed, using This work was co-funded by ZIT - Life Sciences separate age ranges for flexible regularisation of the 2014, grant number 1207843, Project Flowcluster, deformation behaviour due to the age range depen- and by OeNB (15929). dent changes. The atlas learned was evaluated us- ing a leave-one-out cross validation approach for ev- References ery age range and 21 different regularisation kernels were analysed according to their behaviour regard- [1] J. Ashburner. A fast diffeomorphic image regis- ing volume expansion, modelling of cortical surface tration algorithm. NeuroImage, 38(1):95–113, and Dice similarity to manual annotations. The fe- Oct. 2007. 3, 4 tal brain atlas proposed is not capable of modelling [2] J. Ashburner and K. Friston. Diffeomorphic the thinning of ventricles from age range 1 to age registration using geodesic shooting and Gauss- range 3. Since the proposed method uses one regu- Newton optimisation. NeuroImage, 55(3):954– larisation kernel per age range, geodesic regression 967, Apr. 2011. 3, 4 is not able to regularise location specific volume ex- [3] M. Becker and N. Magnenat-Thalmann. De- pansion and shrinkage at the same time. To overcome formable models in medical image segmenta- this issue, the usage of tissue specific regularisation tion. In N. Magnenat-Thalmann, O. Ratib, and and consequently the computation of separate ven- H. Choi, editors, 3D Multiscale Physiological tricle atlases are a possible solution. In contrast to Human, pages 81–106. Springer London, Jan. this, the increase of the cortical folding grade and of 2014. 1 the volume over time are integrated in the proposed spatio-temporal model. The quality of transformed [4] L. Breysem, H. Bosmans, S. Dymarkowski, atlas based segmentations to subject’s space using D. V. Schoubroeck, I. Witters, J. Deprest, pairwise registration leads to the conclusion that the P. Demaerel, D. Vanbeckevoort, C. Vanhole, kernel for pairwise registration has to be defined dif- P. Casaer, and M. Smet. The value of fast MR ferently according to the age range and also tissue imaging as an adjunct to ultrasound in prenatal type, for being able to improve the graph cut initiali- diagnosis. European Radiology, 13(7):1538– sation term. Additionally, it is shown that the quality 1548, July 2003. 2 of graph cut labeling is dependent on the initialisa- [5] M. Clemence. How to shorten MRI sequences. tion cost term (atlas segmentation) and the penalty In D. Prayer, editor, Fetal MRI, Medical Radiol- term. A false or displaced atlas segmentation hinders ogy, pages 19–32. Springer Berlin Heidelberg, as cost term the refinement of the graph cut based 2011. 2 [6] L. Dice. Measures of the amount of ecologic as- Toolkit for Fetal Brain MR Image Process- sociation between species. Ecology, 26(3):297– ing. Computer methods and programs in 302, July 1945. 5 biomedicine, 109(1):65–73, Jan. 2013. 3 [7] S. Durrleman, X. Pennec, A. Trouvé, J. Braga, [15] T. Saul, R. Lewiss, and M. Rivera. Accuracy G. Gerig, and N. Ayache. Toward a comprehen- of emergency physician performed bedside ul- sive framework for the spatiotemporal statisti- trasound in determining gestational age in first cal analysis of longitudinal shape data. Interna- trimester pregnancy. Critical Ultrasound Jour- tional Journal of Computer Vision, 103(1):22– nal, 4(1):1–5, Dec. 2012. 2 59, May 2013. 3 [16] J. Scott, P. Habas, K. Kim, V. Rajagopalan, [8] P. Habas, K. Kim, J. Corbett-Detig, K. Hamzelou, J. Corbett-Detig, A. Barkovich, F. Rousseau, O. Glenn, A. Barkovich, and O. Glenn, and C. Studholme. Growth trajecto- C. Studholme. A spatiotemporal atlas of MR ries of the human fetal brain tissues estimated intensity, tissue probability and shape of the from 3D reconstructed in utero MRI. Interna- fetal brain with application to segmentation. tional Journal of Developmental Neuroscience, NeuroImage, 53(2):460–470, Nov. 2010. 1, 2 29(5):529–536, Aug. 2011. 1 [9] Y. Hong, Y. Shi, M. Styner, M. Sanchez, and [17] A. Serag, P. Aljabar, G. Ball, S. Counsell, M. Niethammer. Simple geodesic regression J. Boardman, M. Rutherford, A. Edwards, for image time-series. In B. Dawant, G. Chris- J. Hajnal, and D. Rueckert. Construction of tensen, J. Fitzpatrick, and D. Rueckert, editors, a consistent high-definition spatio-temporal at- Biomedical Image Registration, number 7359 las of the developing brain using adaptive ker- in Lecture Notes in Computer Science, pages nel regression. NeuroImage, 59(3):2255–2265, 11–20. Springer Berlin Heidelberg, Jan. 2012. Feb. 2012. 1, 2 3 [18] C. Studholme. Mapping fetal brain develop- [10] M. Kuklisova-Murgasova, P. Aljabar, L. Srini- ment in utero using magnetic resonance imag- vasan, S. Counsell, V. Doria, A. Serag, I. Gou- ing: the big bang of brain mapping. Annual sias, J. Boardman, M. Rutherford, A. Edwards, review of biomedical engineering, 13:345–368, J. Hajnal, and D. Rueckert. A dynamic 4D Aug. 2011. 1 probabilistic atlas of the developing brain. Neu- [19] A. Toga and P. Thompson. 1 - an introduction roImage, 54(4):2750–2763, Feb. 2011. 1, 2 to maps and atlases of the brain. In A.W. Toga [11] M. Niethammer, Y. Huang, and F. Vialard. and J.C. Mazziotta, editors, Brain Mapping: Geodesic regression for image time-series. In- The Systems, pages 3–32. Academic Press, San ternational Conference MICCAI 2011, 14(Pt Diego, 2000. 1 2):655–662, 2011. 3 [20] J. Yuan, E. Bae, X. Tai, and Y. Boykov. A [12] D. Pugash, U. Nemec, P. Brugger, and continuous max-flow approach to potts model. D. Prayer. Fetal MRI of Normal Brain Devel- In K. Daniilidis, P. Maragos, and N. Paragios, opment. In D. Prayer, editor, Fetal MRI, Medi- editors, Computer Vision ECCV 2010, num- cal Radiology, pages 147–175. Springer Berlin ber 6316 in Lecture Notes in Computer Sci- Heidelberg, Jan. 2011. 5 ence, pages 379–392. Springer Berlin Heidel- [13] L. Risser, F. Vialard, A. Serag, P. Ajabar, and berg, Jan. 2010. 3, 4 D. Rueckert. Construction of diffeomorphic [21] J. Zhan, I. Dinov, J. Li, Z. Zhang, S. Hobel, spatio-temporal atlases using Krcher means and Y. Shi, X. Lin, A. Zamanyan, L. Feng, G. Teng, LDDMM: Application to early cortical devel- F. Fang, Y. Tang, F. Zang, A. Toga, and S. Liu. opment. In Workshop on Image Analysis of Hu- Spatialtemporal atlas of human fetal brain de- man Brain Development (IAHBD), in Interna- velopment during the early second trimester. tional Conference MICCAI 2011, Sept. 2011. NeuroImage, 82:115–126, Nov. 2013. 1, 2 1, 2 [14] F. Rousseau, E. Oubel, J. Pontabry, M. Schweitzer, C. Studholme, M. Koob, and J. Dietemann. BTK: An Open-Source 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 Quantitative Comparison of Feature Matchers Implemented in OpenCV3 Zoltan Pusztai Levente Hajder Eötvös Loránd University MTA SZTAKI Budapest, Hungary Kende u. 13-17. Budapest, Hungary-1111 puzsaai@inf.elte.hu http://web.eee.sztaki.hu Abstract. The latest V3.0 version of the popular The description of the optical flow datasets of Open Computer Vision (OpenCV) framework has just Middlebury database was published in [3]. It was been released in the middle of 2015. The aim of this developed in order to make the optical flow methods paper is to compare the feature trackers implemented comparable. The latest version contains four kinds in the framework. OpenCV contains both feature de- of video sequences: tector, descriptor and matcher algorithms, all possi- ble combinations of those are tried. For the compar- 1. Fluorescent images: Nonrigid motion is taken ison, a structured-light scanner with a turntable was by both color and UV-camera. Dense ground used in order to generate very accurate ground truth truth flow is obtained using hidden fluorescent (GT) tracking data. The tested algorithm on track- texture painted on the scene. The scenes are ing data of four rotating objects are compared. The moved slowly, at each point capturing separate results is quantitatively evaluated as the matched co- test images in visible light, and ground truth im- ordinates can be compared to the GT values. ages with trackable texture in UV light. 2. Synthesized database: Realistic images are gen- 1. INTRODUCTION erated by an image syntheses method. The tracked data can be computed by this system Developing a realistic 3D approach for feature as every parameters of the cameras and the 3D tracker evaluation is very challenging since realisti- scene are known. cally moving 3D objects can simultaneously rotate and translate, moreover, occlusion can also appear in 3. Imagery for Frame Interpolation. GT data is the images. It is not easy to implement a system that computed by interpolating the frames. There- can generate ground truth (GT) data for real-world fore the data is computed by a prediction from 3D objects. The Middlebury database1 is consid- the measured frames. ered as the state-of-the-art GT feature point gener- 4. Stereo Images of Rigid Scenes. Structured light ator. The database itself consists of several datasets scanning is applied first to obtain stereo re- that had been continuously developed since 2002. In construction. (Scharstein and Szeliski 2003). the first period, they generated corresponding feature The optical flow is computed from ground truth points of real-world objects [23]. The first Middle- stereo data. bury dataset can be used for the comparison of fea- ture matchers. Later on, this stereo database was ex- The main limitation of the Middlebury optical tended with novel datasets using structured-light [24] flow database is that the objects move approximately or conditional random fields [18]. Even subpixel ac- linearly, there is no rotating object in the datasets. curacy can be achieved in this way as it is discussed This is a very strict limitation as tracking is a chal- in [22]. lenging task mainly when the same texture is seen However, the stereo setup is too strict limitation from different viewpoint. for us, our goal is to obtain tracking data via multiple It is interesting that the Middlebury multi-view frames. database [25] contains ground truth 3D reconstruc- 1http://vision.middlebury.edu/ tion of two objects, however, the ground truth track- ing data were not generated for these sequences. An- • Poster. The last sequence of our dataset is a other limitation of the dataset is that only two low- rotating poster in a page of a motorcycle mag- textured objects are used. azine. It is a relatively easy object for feature It is obvious that tracking data can also be gen- matchers since it is a well-textured plane. The erated by a depth camera [26] such as Microsoft pure efficiency of the trackers can be checked Kinect, but its accuracy is very limited. There are in this example due to two reasons: (i) there is other interesting GT generators for planar objects no occlusion, and (ii) the GT feature tracking is such as the work proposed in [8], however, we would equivalent to the determination of plane-plane like to obtain the tracked feature points of real spatial homographies. objects. Due to these limitations, we decided to build a spe- cial hardware in order to generate ground truth data. Our approach is based on a turntable, a camera, and a projector. They are not too costly, but the whole setup is very accurate as it is shown in our accepted paper [19]. 2. Datasets We have generated four GT datasets as it is pub- lished in our mentioned paper [19]. The feature points are always selected by the tested feature gen- erator method in all frames and then these feature locations are matched between the frames. Then the matched point are filtered: the fundamental ma- trix [9] is robustly computed using 8-point method with RANSAC for every image pair and the outliers are removed from the results. The method imple- mented in the OpenCV framework is used for this robustification. Examples for the moving GT feature points of the generated sets are visualized in Figures 1– 4. Point locations are visualized by light blue dots. The feature matchers are tested in four data se- quences: • Dinosaur. A typical computer vision study deals with the reconstruction of a dinosaurs as it is shown in several scientific papers, e.g [6]. Figure 5. Reconstructed 3D model of testing objects. Top: Plush Dog. Center: Dinosaur. Bottom: Flacon. It has a simple diffuse surface that is easy to re- construct in 3D, hence the feature matching is possible. For this reason, a dino is inserted to our testing dataset. 2.1. GT Data Generation Firstly, the possibilities is overviewed that • Flacon. The plastic holder is another smooth OpenCV can give about feature tracking. These are and diffuse surface. A well-textured label is the currently supported feature detectors in OpenCV fixed on the surface. AGAST [13], AKAZE [17], BRISK [10], FAST [20], • Plush Dog. The tracking of the feature point GFTT [28] (Good Features To Track – also known of a soft toy is a challenging task as it does not as Shi-Tomasi corners), KAZE [2], MSER [14], have a flat surface. A plush dog is included into ORB [21]. the testing database that is a real challenge for However, if you compile the contrib(nonfree) feature trackers. repository with the OpenCV, you can also get the Figure 1. GT moving feature points of sequence ’Flacon’. Figure 2. GT moving feature points of sequence ’Poster’. following detectors: SIFT [12], STAR [1], and the matching is started. Every image pair is taken SURF [4]. into consideration, and match each feature point in We use our scanner to take 20 images about a the first image with one in the second image. This rotating object. After each image taken, a struc- means that every feature point in the first image will tured light sequence is projected in order to make have a pair in the second one. However, there can be the reconstruction available for every position. (re- some feature locations in the second image, which constructing only the points in the first image is not has more corresponding feature points in the first enough.) one, but it is also possible that there is no matching Then we start searching for features in these im- point. ages using all feature detectors. After the detection The matching itself is done by calculating the is completed, it is required to extract descriptors. De- minimum distances between the descriptor vectors. scriptors are needed for matching the feature points This distance is defined by the feature tracking in different frames. The following descriptors are method used. The following matchers are available used (each can be found in OpenCV): AKAZE [17], in OpenCV: BRISK [10], KAZE [2], ORB [21]. If one compiles the contrib repository, he/she can also get SIFT [12], • L2 – BruteForce: a brute force minimization al- SURF [4], BRIEF [5], FREAK [16], LATCH [11], gorithm that computes each possible matches. DAISY [27] descriptors 2. The error is the L2 norm of the difference be- Another important issue is the parameterization of tween feature descriptors. the feature trackers. It is obvious that the most ac- • L1 – BruteForce: It is the same as L2 – Brute- curate strategy is to find the best system parameters Force, but L1 norm is used instead of L2 one. for the methods, nevertheless the optimal parameters can differ for each testing video. On the other hand, • Hamming – BruteForce: For binary fea- we think that the authors of the tested methods can ture descriptor (BRISK, BRIEF, FREAK, set the parameters more accurately than us as they LETCH,ORB,AKAZE), the Hamming distance are interested in good performance. For this reason, is used. the default parameter setting is used for each method, and we plan to make the dataset available for every- • Hamming2 – BruteForce: A variant of the ham- one and then the authors themselves can parameter- ming distance is used. The difference between ize their methods. Hamming and Hamming2 is that the former After the detection and the extraction are done, considers every bit as element of the vector, while Hamming2 use integer number, each bit 2The BRIEF descriptor is not invariant to rotation, however, pair forms a number from interval 0 . . . 3 3. we hold it in the set of testing algorithms as it surprisingly served good results. 3OpenCV’s documentation is not very informative about Figure 3. GT moving feature points of sequence ’Dinosaur’. Figure 4. GT moving feature points of sequence ’Plush Dog’. • Flann-Based: FLANN (Fast Library for Ap- proximate Nearest Neighbors) is a set of al- gorithms optimized for fast nearest neighbor search in large datasets and for high dimen- sional features [15]. It is needed to point out that one can pair each fea- ture detector with each feature descriptor but each feature matchers is not applicable for every descrip- tor. An exception is thrown by OpenCV if the se- lected algorithms cannot work together. But we try to evaluate every possible selection. The comparison of the feature tracker predictions with the ground truth data is as follows: The feature points are reconstructed first in 3D using the images Figure 6. Error measurement based on simple Euclidean and the structured light. Then, because it is known distances. that the turntable was rotated by 3 degrees per im- ages, the projections of the points are calculated for all the remaining images. These projections were However, this comparison is not good enough be- compared to the matched point locations of the fea- cause if a method fails to match correctly the feature ture trackers and the L points in an image pair, then the feature point moves 2 norm is used to calculate the distances. to an incorrect location in the next image. Therefore, the tracker follows the incorrect location in the re- 3. Evaluation Methodology maining frames and the new matching positions in those images will also be incorrect. The easiest and usual way for comparing the To avoid this effect, a new GT point is generated tracked feature points is to compute the summa at the location of the matched point even if it is an and/or average and/or median of the 2D tracking er- incorrect matching. The GT location of that point rors in each image. This error is defined as the Eu- can be determined in the remaining frames since that clidean distance of the tracked and GT locations. point can be reconstructed in 3D as well using the This methodology is visualized in Fig. 6. structured light scanning, and the novel positions of the new GT point can be determined using the cali- Hamming2 distance. They suggest the usage of that for ORB bration data of the test sequence. features. However, it can be applied for other possible descrip- tors, all possible combinations are tried during our tests. Then the novel matching results are compared to all the previously determined GT points. The ob- is also counted. Furthermore, the average length of tained error values are visualized in Fig. 7. the feature tracks is calculated which shows that in The error of a feature point for the i-th frame is the how many images an average feature point is tracked weighted average of all the errors calculated for that through. feature. For example, there is only one error value for the second frame as the matching error can only 4. Comparison of the methods be compared to the GT location of the feature de- The purpose of this section is to show the main is- tected in the first image. For the third frame, there sues occurred during the testing of the feature match- are two GT locations since GT error generated on ers. Unfortunately, we cannot show to the Reader all both the first (original position) and second (position the charts due to the lack of space. from first matching) image. For the i-th image, i − 1 General remark. The charts in this section show error values are obtained. the error is calculated as different combinations of detectors, descriptors, and the weighted average of those. It can be formalized matchers. The method ’xxx:yyy:zzz’ denotes in the as charts that the current method uses the detector ’xxx’, i−1 ||p || X i − p0i,n 2 Error descriptor ’yyy’, and matcher algorithm ’zzz’. p = (1) i i − n n=1 4.1. Feature Generation and Filtering using the where Errorp is the error for the i-th frame, p i i Fundamental Matrix the location of the tested feature detector, while p0i,n The number of the detected feature points is exam- is the GT location of the feature points reconstructed ined first. It is an important property of the matcher from the n-th frame. The weights of the distances is algorithms since many good points are required for a 1/(i − n) that means that older GT points has less typical computer vision application. For example, at weights. Remark that the Euclidean (L2) norm is least hundreds of points are required to compute 3D chosen in order to measure the pixel distances. reconstruction of the observed scene. The matched If a feature point is only detected in one image and filtered values are calculated as the average of and was not being followed in the next one (or was the numbers of generated features for all the frames filtered out in the fundamental-matrix-based filtering as features can be independently generated in each step), then that point is discarded. image of the test sequences. Tables 1– 4 show the number of the generated features (left) and that of the filtered ones. There are a few interesting behaviors within the data: • The best images for feature tracking are ob- tained when the poster is rotated. The feature generators give significantly the most points in this case. It is a more challenging task to find goof feature points for the rotating dog and di- nosaur. It is because the area of these objects in the images are smaller than that of the other two ones (flacon and poster). • It is clearly seen that number of SURF feature points are the highest in all test cases after out- Figure 7. Applied error measurement. lier removal. This fact suggests that they will be the more accurate features. After the pixel errors are valuated for each point in all possible images, the minimum, maximum, • The MSER method gives the most number of summa, average, and median error values of every feature points, however, more than 90% of those feature points are calculated per image. The num- are filtered. Unfortunately, the OpenCV3 li- ber of tracked feature points in the processed image brary does not contain sophisticate matchers for Table 1. Average of generated feature points and inliers of Table 3. Average of generated feature points and inliers of Sequence ’Plush Dog’. Sequence ’Flacon’. Detector #Features #Inliers Detector #Features #Inliers BRISK 21.7 16.9 BRISK 219.7 160.99 FAST 19.65 9.48 FAST 387.05 275.4 GFTT 1000 38.16 GFTT 1000 593.4 KAZE 68.6 40.76 KAZE 484.1 387.93 MSER 5321.1 10.56 MSER 3664.1 31.72 ORB 42.25 34.12 ORB 337.65 287.49 SIFT 67.7 42.8 SIFT 348.15 260.91 STAR 7.15 5.97 STAR 69.1 54.86 SURF 514.05 326.02 SURF 952.95 726.83 AGAST 22.45 11.83 AGAST 410.15 303.45 AKAZE 144 101.68 AKAZE 655 553.11 Table 2. Average of generated feature points and inliers of Table 4. Average of generated feature points and inliers of Sequence ’Poster’. Sequence ’Dinosaur’. Detector #Features #Inliers Detector #Features #Inliers BRISK 233.55 188.79 BRISK 21.55 14.8 FAST 224.75 139.22 FAST 51.05 27.01 GFTT 956.65 618.75 GFTT 1000 92 KAZE 573.45 469.18 KAZE 58.55 33.92 MSER 4863.6 40.29 MSER 5144.4 17.86 ORB 259.5 230.76 ORB 67.1 45.87 SIFT 413.35 343.08 SIFT 52.8 34.96 STAR 41.25 35.22 STAR 3.45 3.45 SURF 1876.95 1577.73 SURF 276.95 132.61 AGAST 275.75 200.25 AGAST 55 29.86 AKAZE 815 761.4 AKAZE 89.1 59.2 MSER such as [7], therefore its accuracy is rel- atively low. amined, while the detectors are only combined with their own descriptor in the second test. • Remark that the GFTT algorithm usually gives It is important to note that not only the errors of 1000 points as the maximum number was set to feature trackers should be compared, we must also thousand for this method. It is a parameter of pay attention to the number of features in the images OpenCV that may be changed, but we did not and the length of the feature tracks. A method with modify this value. less detected features usually obtains better results (lower error rate) than other methods with higher 4.2. Matching accuracy number of features. The mostly used chart is the Two comparisons were carried out for the feature AVG-MED, where the average and the median of the tracker methods. In the first test, every possible com- errors are shown. bination of the feature detectors and descriptors is ex- Testing of all possible algorithms. As it is seen in Fig 8 (sequence ’Plush Dog’), the the objects is rich in features, but the ’Flacon’ is a SURF method dominates the chart. With the usage spatial object. However, if we look at Fig. 10 where of SURF, DAISY, BRIEF, and BRISK descriptors the methods with the lowest 10 median value were more than 300 feature points remained and the me- plotted, one can see that KAZE and SIFT had more dian values of the errors are below 2.5 pixels, while feature points and can track these over more pictures the average is around 5 pixels. Moreover, the points than MSER or SURF after the fundamental filtering. are tracked through 4 images in average which yields Even though they had the lowest median values, the pretty impressive statistics for the SURF detector. average errors of these methods were rather high. However, if one takes a look at the methods with the lowest average error, then he/she can observe that AKAZE, KAZE and SURF present in the top 10. These methods can track more points then the pre- vious ones and the median errors are just around 2.0 pixels. Figure 8. Average and median errors of top 10 methods for sequence ’Plush Dog’. The next test object was the ’Poster’. The results are visualized in Fig 4.2. It is interesting to note that if the trackers are sorted by the number of the outliers and plot the top 10 methods, only the AKAZE detec- tor remains where more than 90 percent of the fea- ture points was considered as inlier. Besides the high number of points, average pixel error is between 3 and 5 pixels depending on the descriptor and matcher type. Figure 10. Top 10 method with the lowest median for se- quence ’Flacon’. Chart are sorted by median (top) and average (bottom) values. For the sequence ’Dinosaur’ (Figure 11), the test object is very dark which makes feature detection hard. The number of available points is slightly more Figure 9. Average and median errors of top 10 methods than 100. In this case, the overall winner of the meth- for sequence ’Poster’. ods is the SURF with both the lowest average and median errors. However, GFTT also present in the last chart too. In the test where the ’Flacon’ object was used, we In the upper comparisons only the detectors were got similar results as in the case of ’Poster’. Both of mentioned against each other. As one can see in the charts, most of the methods used either DAISY, BRIEF, BRISK or SURF descriptors. From the per- spective of matchers, it does not really matter which type of the matcher is used for the same detector descriptor type. However, if the descriptor gives a binary vector, then obviously the hamming distance outperforms the L2 or L1. But there are just slightly differences between the L1-L2 and H1-H2 distances. Figure 12. Overall average (top) and median (bottom) er- Figure 11. Top 10 methods (with lowest average error) on ror values for all trackers and test sequences. The detec- sequence ’Dinosaur’. tors and descriptors were the same. Testing of algorithms with same detector and de- scriptor. In this comparison, only the detectors that The most important conclusion for us is that such a have an own descriptor are tested. Always the best comparison is a very hard task: for example, there are matchers is selected for which the error is minimal infinite number of possible error metrics; the quality for the observed detector/descriptor. is hardly influenced by the number of features, and so As it can be seen in the log-scale charts in Fig. 12, on. The main limitation here is that we can only test the median error is almost the same for the AKAZE, the methods in images of rotating objects. We are not KAZE, ORB and SURF trackers, but SURF is con- sure that the same performance would be obtained if sidered with the lowest average value. The tests ’Fla- translating objects are observed. A possible solution con’ and ’Poster’ result the lower pixel errors. On to the extension of this paper is to compare the same the other hand the rotation of the ’Dinosaur’ was the methods on the Middlebury database and unify the hardest to track, it resulted much higher errors for all obtained results for rotation and translation. trackers comparing to the other tests. We hope that this paper is just the very first step of our research. We plan to generate more testing data, 5. Conclusions, Limitations, and Future and more algorithms will also be involved into the Work tests. The GT dataset will be online, and an open- source testing system is also planned to be available We quantitatively compared the well-known fea- soon 4. ture detectors, descriptors, and matchers imple- mented in OpenCV3 in this study. The GT datasets References was generated by a structured-light scanner. The four testing objects were rotated by the turntable of our [1] M. Agrawal and K. Konolige. Censure: Center sur- equipment. It seems to be clear that the most accu- round extremas for realtime feature detection and matching. In ECCV, 2008. 3 rate feature for matching methods is the SURF [4] one proposed by Bay et al. It outperforms the other [2] P. F. Alcantarilla, A. Bartoli, and A. J. Davison. Kaze features. In ECCV (6), pages 214–227, 2012. algorithms in all test cases. The other very accurate 2, 3, 8 algorithms are KAZE [2]/AKAZE [17], they are the runner-up in our competition. 4See http://web.eee.sztaki.hu [3] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black, [17] A. B. Pablo Alcantarilla (Georgia Institute of Tech- and R. Szeliski. A database and evaluation method- nolog), Jesus Nuevo (TrueVision Solutions AU). ology for optical flow. International Journal of Fast explicit diffusion for accelerated features in Computer Vision, 92(1):1–31, 2011. 1 nonlinear scale spaces. In Proceedings of the British [4] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Machine Vision Conference. BMVA Press, 2013. 2, Speeded-up robust features (surf). Comput. Vis. Im- 3, 8 age Underst., 110(3):346–359, 2008. 3, 8 [18] C. J. Pal, J. J. Weinman, L. C. Tran, and [5] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. D. Scharstein. On learning conditional random fields Brief: Binary robust independent elementary fea- for stereo - exploring model structures and approxi- tures. In Proceedings of the 11th European Confer- mate inference. International Journal of Computer ence on Computer Vision: Part IV, pages 778–792, Vision, 99(3):319–337, 2012. 1 2010. 3 [19] Z. Pusztai and L. Hajder. A Turntable-based Ap- [6] A. W. ”Fitzgibbon, G. Cross, and A. Zisserman. proach for Ground Truth Tracking Data Generation ”automatic 3D model construction for turn-table se- . In VISAPP 2016, pages 498–509, 2016. 2 quences”. In ”3D Structure from Multiple Images [20] E. Rosten and T. Drummond. Fusing points and lines of Large-Scale Environments, LNCS 1506”, pages for high performance tracking. In In Internation ”155–170”, ”1998”. 2 Conference on Computer Vision, pages 1508–1515, [7] P.-E. Forssn and D. G. Lowe. Shape descriptors for 2005. 2 maximally stable extremal regions. In ICCV. IEEE, [21] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. 2007. 6 Orb: An efficient alternative to sift or surf. In Inter- [8] S. Gauglitz, T. Höllerer, and M. Turk. Evaluation national Conference on Computer Vision, 2011. 2, of interest point detectors and feature descriptors for 3 visual tracking. International Journal of Computer [22] D. Scharstein, H. Hirschmüller, Y. Kitajima, Vision, 94(3):335–360, 2011. 2 G. Krathwohl, N. Nesic, X. Wang, and P. West- [9] R. I. Hartley and A. Zisserman. Multiple View Ge- ling. High-resolution stereo datasets with subpixel- ometry in Computer Vision. Cambridge University accurate ground truth. In Pattern Recognition - 36th Press, 2003. 2 German Conference, GCPR 2014, Münster, Ger- many, September 2-5, 2014, Proceedings, pages 31– [10] S. Leutenegger, M. Chli, and R. Y. Siegwart. Brisk: 42, 2014. 1 Binary robust invariant scalable keypoints. In Pro- ceedings of the 2011 International Conference on [23] D. Scharstein and R. Szeliski. A Taxonomy and Computer Vision, ICCV ’11, pages 2548–2555, Evaluation of Dense Two-Frame Stereo Correspon- 2011. 2, 3 dence Algorithms. International Journal of Com- puter Vision, 47:7–42, 2002. 1 [11] G. Levi and T. Hassner. LATCH: learned arrange- ments of three patch codes. CoRR, 2015. 3 [24] D. Scharstein and R. Szeliski. High-accuracy stereo depth maps using structured light. In CVPR (1), [12] D. G. Lowe. Object recognition from local scale- pages 195–202, 2003. 1 invariant features. In Proceedings of the Interna- [25] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and tional Conference on Computer Vision, ICCV ’99, pages 1150–1157, 1999. 3 R. Szeliski. A comparison and evaluation of multi- view stereo reconstruction algorithms. In 2006 IEEE [13] E. Mair, G. D. Hager, D. Burschka, M. Suppa, and Computer Society Conference on Computer Vision G. Hirzinger. Adaptive and generic corner detection and Pattern Recognition (CVPR 2006), 17-22 June based on the accelerated segment test. In Proceed- 2006, New York, NY, USA, pages 519–528, 2006. 1 ings of the 11th European Conference on Computer [26] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and Vision: Part II, pages 183–196, 2010. 2 D. Cremers. ”a benchmark for the evaluation of [14] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust rgb-d slam systems”. In ”Proc. of the International wide baseline stereo from maximally stable extremal Conference on Intelligent Robot Systems (IROS)”, regions. In Proc. BMVC, pages 36.1–36.10, 2002. 2 ”2012”. 2 [15] M. Muja and D. G. Lowe. Fast approximate nearest [27] E. Tola, V. Lepetit, and P. Fua. Daisy: An ef- neighbors with automatic algorithm configuration. ficient dense descriptor applied to wide baseline In In VISAPP International Conference on Com- stereo. IEEE TRANS. PATTERN ANALYSIS AND puter Vision Theory and Applications, pages 331– MACHINE INTELLIGENCE, 32(5), 2010. 3 340, 2009. 4 [28] Tomasi, C. and Shi, J. Good Features to Track. In [16] R. Ortiz. Freak: Fast retina keypoint. In Proceed- IEEE Conf. Computer Vision and Pattern Recogni- ings of the 2012 IEEE Conference on Computer Vi- tion, pages 593–600, 1994. 2 sion and Pattern Recognition (CVPR), pages 510– 517, 2012. 3 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 Real-Time Eye Blink Detection using Facial Landmarks Tereza Soukupová and Jan Čech Center for Machine Perception, Department of Cybernetics Faculty of Electrical Engineering, Czech Technical University in Prague {soukuter,cechj}@cmp.felk.cvut.cz Abstract. A real-time algorithm to detect eye blinks in a video sequence from a standard camera is pro- posed. Recent landmark detectors, trained on in-the- wild datasets exhibit excellent robustness against a head orientation with respect to a camera, varying illumination and facial expressions. We show that the landmarks are detected precisely enough to reli- ably estimate the level of the eye opening. The pro- posed algorithm therefore estimates the landmark positions, extracts a single scalar quantity – eye as- pect ratio (EAR) – characterizing the eye opening in Figure 1: Open and closed eyes with landmarks pi each frame. Finally, an SVM classifier detects eye automatically detected by [1]. The eye aspect ratio blinks as a pattern of EAR values in a short tempo- EAR in Eq. (1) plotted for several frames of a video ral window. The simple algorithm outperforms the sequence. A single blink is present. state-of-the-art results on two standard datasets. 1. Introduction a Viola-Jones type detector. Next, motion in the eye Detecting eye blinks is important for instance in area is estimated from optical flow, by sparse track- systems that monitor a human operator vigilance, ing [7, 8], or by frame-to-frame intensity differenc- e.g. driver drowsiness [5, 13], in systems that warn ing and adaptive thresholding. Finally, a decision is a computer user staring at the screen without blink- made whether the eyes are or are not covered by eye- ing for a long time to prevent the dry eye and the lids [9, 15]. A different approach is to infer the state computer vision syndromes [17, 7, 8], in human-of the eye opening from a single image, as e.g. by computer interfaces that ease communication for dis- correlation matching with open and closed eye tem- abled people [15], or for anti-spoofing protection in plates [4], a heuristic horizontal or vertical image in- face recognition systems [11]. tensity projection over the eye region [5, 6], a para- metric model fitting to find the eyelids [18], or active Existing methods are either active or passive. Ac-shape models [14]. tive methods are reliable but use special hardware, often expensive and intrusive, e.g. infrared cameras A major drawback of the previous approaches is and illuminators [2], wearable devices, glasses with that they usually implicitly impose too strong re- a special close-up cameras observing the eyes [10]. quirements on the setup, in the sense of a relative While the passive systems rely on a standard remote face-camera pose (head orientation), image resolu- camera only. tion, illumination, motion dynamics, etc. Especially Many methods have been proposed to automati- the heuristic methods that use raw image intensity cally detect eye blinks in a video sequence. Several are likely to be very sensitive despite their real-time methods are based on a motion estimation in the eye performance. region. Typically, the face and eyes are detected by However nowadays, robust real-time facial land- mark detectors that capture most of the character- istic points on a human face image, including eye corners and eyelids, are available, see Fig. 1. Most of the state-of-the-art landmark detectors formulate a regression problem, where a mapping from an im- age into landmark positions [16] or into other land- mark parametrization [1] is learned. These mod- ern landmark detectors are trained on “in-the-wild datasets” and they are thus robust to varying illu- mination, various facial expressions, and moderate non-frontal head rotations. An average error of the landmark localization of a state-of-the-art detector is usually below five percent of the inter-ocular dis- Eye aspect ratio: tance. Recent methods run even significantly super 0.4 real-time [12]. 0.2 Therefore, we propose a simple but efficient al- 0 gorithm to detect eye blinks by using a recent facial landmark detector. A single scalar quantity that re- EAR thresholding (t = 0.2): blink flects a level of the eye opening is derived from the landmarks. Finally, having a per-frame sequence of the eye opening estimates, the eye blinks are found non-blink by an SVM classifier that is trained on examples of blinking and non-blinking patterns. EAR SVM output: blink Facial segmentation model presented in [14] is similar to the proposed method. However, their sys- tem is based on active shape models with reported non-blink processing time of about 5 seconds per frame for the Ground-truth: segmentation, and the eye opening signal is normal- blink ized by statistics estimated by observing a longer se- quence. The system is thus usable for offline pro- half cessing only. The proposed algorithm runs real-time, non-blink since the extra costs of the eye opening from land- marks and the linear SVM are negligible. Figure 2: Example of detected blinks. The plots of The contributions of the paper are: the eye aspect ratio EAR in Eq. (1), results of the EAR thresholding (threshold set to 0.2), the blinks 1. Ability of two state-of-the-art landmark de- detected by EAR SVM and the ground-truth labels tectors [1, 16] to reliably distinguish between over the video sequence. Input image with detected the open and closed eye states is quantita- landmarks (depicted frame is marked by a red line). tively demonstrated on a challenging in-the- wild dataset and for various face image resolu- tions. tion and evaluation is presented in Sec. 3. Finally, Sec. 4 concludes the paper. 2. A novel real-time eye blink detection algorithm which integrates a landmark detector and a clas- 2. Proposed method sifier is proposed. The evaluation is done on two standard datasets [11, 8] achieving state-of-the- The eye blink is a fast closing and reopening of art results. a human eye. Each individual has a little bit different pattern of blinks. The pattern differs in the speed of The rest of the paper is structured as follows: The closing and opening, a degree of squeezing the eye algorithm is detailed in Sec. 2, experimental valida- and in a blink duration. The eye blink lasts approxi- mately 100-400 ms. ground-truth blinks, while the negatives are those We propose to exploit state-of-the-art facial land- that are sampled from parts of the videos where no mark detectors to localize the eyes and eyelid con- blink occurs, with 5 frames spacing and 7 frames tours. From the landmarks detected in the image, margin from the ground-truth blinks. While testing, a we derive the eye aspect ratio (EAR) that is used as classifier is executed in a scanning-window fashion. an estimate of the eye opening state. Since the per- A 13-dimensional feature is computed and classified frame EAR may not necessarily recognize the eye by EAR SVM for each frame except the beginning blinks correctly, a classifier that takes a larger tem- and ending of a video sequence. poral window of a frame into account is trained. 3. Experiments 2.1. Description of features Two types of experiments were carried out: The For every video frame, the eye landmarks are de- experiments that measure accuracy of the landmark tected. The eye aspect ratio (EAR) between height detectors, see Sec. 3.1, and the experiments that eval- and width of the eye is computed. uate performance of the whole eye blink detection kp2 − p6k + kp3 − p5k algorithm, see Sec 3.2. EAR = , (1) 2kp1 − p4k 3.1. Accuracy of landmark detectors where p1, . . . , p6 are the 2D landmark locations, de- picted in Fig. 1. To evaluate accuracy of tested landmark detectors, The EAR is mostly constant when an eye is open we used the 300-VW dataset [19]. It is a dataset con- and is getting close to zero while closing an eye. It taining 50 videos where each frame has associated a is partially person and head pose insensitive. Aspect precise annotation of facial landmarks. The videos ratio of the open eye has a small variance among indi- are “in-the-wild”, mostly recorded from a TV. viduals and it is fully invariant to a uniform scaling of The purpose of the following tests is to demon- the image and in-plane rotation of the face. Since eye strate that recent landmark detectors are particularly blinking is performed by both eyes synchronously, robust and precise in detecting eyes, i.e. the eye- the EAR of both eyes is averaged. An example of corners and contour of the eyelids. Therefore we pre- an EAR signal over the video sequence is shown in pared a dataset, a subset of the 300-VW, containing Fig. 1, 2, 7. sample images with both open and closed eyes. More A similar feature to measure the eye opening was precisely, having the ground-truth landmark annota- suggested in [9], but it was derived from the eye seg- tion, we sorted the frames for each subject by the eye mentation in a binary image. aspect ratio (EAR in Eq. (1)) and took 10 frames of the highest ratio (eyes wide open), 10 frames of the 2.2. Classification lowest ratio (mostly eyes tightly shut) and 10 frames It generally does not hold that low value of the sampled randomly. Thus we collected 1500 images. EAR means that a person is blinking. A low value Moreover, all the images were later subsampled (suc- of the EAR may occur when a subject closes his/her cessively 10 times by factor 0.75) in order to evaluate eyes intentionally for a longer time or performs a fa- accuracy of tested detectors on small face images. cial expression, yawning, etc., or the EAR captures a Two state-of-the-art landmark detectors were short random fluctuation of the landmarks. tested: Chehra [1] and Intraface [16]. Both run in Therefore, we propose a classifier that takes a real-time1. Samples from the dataset are shown in larger temporal window of a frame as an input. For Fig. 3. Notice that faces are not always frontal to the the 30fps videos, we experimentally found that ±6 camera, the expression is not always neutral, peo- frames can have a significant impact on a blink detec- ple are often emotionally speaking or smiling, etc. tion for a frame where an eye is the most closed when Sometimes people wear glasses, hair may occasion- blinking. Thus, for each frame, a 13-dimensional ally partially occlude one of the eyes. Both detectors feature is gathered by concatenating the EARs of its perform generally well, but the Intraface is more ro- ±6 neighboring frames. bust to very small face images, sometimes at impres- This is implemented by a linear SVM classifier sive extent as shown in Fig. 3. (called EAR SVM) trained from manually anno- tated sequences. Positive examples are collected as 1Intraface runs in 50 Hz on a standard laptop. All landmarks Chehra Chehra 100 Intraface Intraface 80 60 40 occurance [%] Chehra 20 Intraface Chehra−small Chehra Chehra Intraface−small Intraface Intraface 0 0 5 10 15 20 25 localization error [% of IOD] Eye landmarks 100 80 60 Figure 3: Example images from the 300-VW dataset 40 occurance [%] with landmarks obtained by Chehra [1] and In- Chehra 20 Intraface traface [16]. Original images (left) with inter-ocular Chehra−small distance (IOD) equal to 63 (top) and 53 (bottom) pix- Intraface−small 0 els. Images subsampled (right) to IOD equal to 6.3 0 5 10 15 20 25 localization error [% of IOD] (top) and 17 (bottom). Figure 4: Cumulative histogram of average localiza- tion error of all 49 landmarks (top) and 12 landmarks Quantitatively, the accuracy of the landmark de- of the eyes (bottom). The histograms are computed tection for a face image is measured by the average for original resolution images (solid lines) and a sub- relative landmark localization error, defined as usu- set of small images (IOD ≤ 50 px). ally N 100 X = ||xi − ˆ xi||2, (2) the Intraface is always more precise than Chehra. As κN i=1 already mentioned, the Intraface is much more robust where x to small images than Chehra. This behaviour is fur- i is the ground-truth location of landmark i in the image, ˆ x ther observed in the following experiment. i is an estimated landmark location by a detector, N is a number of landmarks and normal- Taking a set of all 15k images, we measured a ization factor κ is the inter-ocular distance (IOD), i.e. mean localization error µ as a function of a face im- Euclidean distance between eye centers in the image. age resolution determined by the IOD. More pre- P First, a standard cumulative histogram of the aver- cisely, µ = 1 |S| j∈S j , i.e. average error over set of age relative landmark localization error was calcu- face images S having the IOD in a given range. Re- lated, see Fig. 4, for a complete set of 49 landmarks sults are shown in Fig. 5. Plots have errorbars of stan- and also for a subset of 12 landmarks of the eyes only, dard deviation. It is seen that Chehra fails quickly since these landmarks are used in the proposed eye for images with IOD < 20 px. For larger faces, the blink detector. The results are calculated for all the mean error is comparable, although slightly better for original images that have average IOD around 80 px, Intraface for the eye landmarks. and also for all “small” face images (including sub- The last test is directly related to the eye blink de- sampled ones) having IOD ≤ 50 px. For all land- tector. We measured accuracy of EAR as a func- marks, Chehra has more occurrences of very small tion of the IOD. Mean EAR error is defined as a errors (up to 5 percent of the IOD), but Intraface is mean absolute difference between the true and the more robust having more occurrences of errors be- estimated EAR. The plots are computed for two sub- low 10 percent of the IOD. For eye landmarks only, sets: closed/closing (average true ratio 0.05 ± 0.05) All landmarks Low opening ratio (ρ < 0.15) 50 0.4 Chehra Chehra Intraface 0.35 Intraface 40 0.3 0.25 30 0.2 20 0.15 0.1 mean error [% of IOD] mean eye opening error 10 0.05 0 0 0 20 40 60 80 100 0 20 40 60 80 100 IOD [px] IOD [px] High opening ratio (ρ > 0.25) Eye landmarks 0.4 50 Chehra Chehra 0.35 Intraface Intraface 40 0.3 0.25 30 0.2 0.15 20 0.1 mean eye opening error mean error [% of IOD] 10 0.05 0 0 20 40 60 80 100 0 0 20 40 60 80 100 IOD [px] IOD [px] Figure 6: Accuracy of the eye-opening ratio as a Figure 5: Landmark localization accuracy as a func- function of the face image resolution. Top: for tion of the face image resolution computed for all images with small true ratio (mostly closing/closed landmarks and eye landmarks only. eyes), and bottom: images with higher ratio (open eyes). and open eyes (average true ratio 0.4 ± 0.1). The error is higher for closed eyes. The reason is prob- either smile nor speak. A ground-truth blink is de- ably that both detectors are more likely to output fined by its beginning frame, peak frame and ending open eyes in case of a failure. It is seen that ratio frame. The second database Eyeblink8 [8] is more error for IOD < 20 px causes a major confusion challenging. It consists of 8 long videos of 4 sub- between open/close eye states for Chehra, neverthe- jects that are smiling, rotating head naturally, cover- less for larger faces the ratio is estimated precisely ing face with hands, yawning, drinking and looking enough to ensure a reliable eye blink detection. down probably on a keyboard. These videos have length from 5k to 11k frames, also 30fps, with a res- 3.2. Eye blink detector evaluation olution 640 × 480 pixels and an average IOD 62.9 We evaluate on two standard databases with pixels. They contain about 50 blinks on average per ground-truth annotations of blinks. The first one is video. Each frame belonging to a blink is annotated ZJU [11] consisting of 80 short videos of 20 sub- by half-open or close state of the eyes. We consider jects. Each subject has 4 videos: 2 with and 2 without half blinks, which do not achieve the close state, as glasses, 3 videos are frontal and 1 is an upward view. full blinks to be consistent with the ZJU. The 30fps videos are of size 320 × 240 px. An av- Besides testing the proposed EAR SVM methods, erage video length is 136 frames and contains about that are trained to detect the specific blink pattern, 3.6 blinks in average. An average IOD is 57.4 pixels. we compare with a simple baseline method, which In this database, subjects do not perform any notice- only thresholds the EAR in Eq. (1) values. The EAR able facial expressions. They look straight into the SVM classifiers are tested with both landmark detec- camera at close distance, almost do not move, do not tors Chehra [1] and Intraface [16]. section with detected blinks. The number of false negatives is counted as a number of the ground-truth blinks which do not intersect detected blinks. The number of false positives is equal to the number of detected blinks minus the number of true positives plus a penalty for detecting too long blinks. The penalty is counted only for detecting blinks twice longer then an average blink of length A. Every long blink of length L is counted L times as a false posi- A tive. The number of all possibly detectable blinks is computed as number of frames of a video sequence divided by subject average blink length following Drutarovsky and Fogelton [8]. Eye aspect ratio: The ZJU database appears relatively easy. It 0.4 mostly holds that every eye closing is an eye blink. 0.2 Consequently, the precision-recall curves shown in 0 Fig. 8a of the EAR thresholding and both EAR SVM EAR thresholding (t = 0.2) classifiers are almost identical. These curves were blink calculated by spanning a threshold of the EAR and SVM output score respectively. All our methods outperform other detectors [9, 8, 5]. The published non-blink methods presented the precision and the recall for a EAR SVM output: single operation point only, not the precision-recall blink curve. See Fig. 8a for comparison. The precision-recall curves in Fig. 8b shows eval- non-blink uation on the Eyeblink8 database. We observe that in this challenging database the EAR thresholding lags Ground-truth: behind both EAR SVM classifiers. The thresholding blink fails when a subject smiles (has narrowed eyes - see half an example in Fig. 7), has a side view or when the non-blink subject closes his/her eyes for a time longer than a blink duration. Both SVM detectors performs much Figure 7: Example of detected blinks where the better, the Intraface detector based SVM is even a EAR thresholding fails while EAR SVM succeeds. little better than the Chehra SVM. Both EAR SVM The plots of the eye aspect ratio EAR in Eq. (1), re- detectors outperform the method by Drutarovsky and sults of the EAR thresholding (threshold set to 0.2), Fogelton [8] by a significant margin. the blinks detected by EAR SVM and the ground- Finally, we measured a dependence of the whole truth labels over the video sequence. Input image blink detector accuracy on the average IOD over the with detected landmarks (depicted frame is marked dataset. Every frame of the ZJU database was sub- by a red line). sampled to 90%, 80%, ..., 10% of its original reso- lution. Both Chehra-SVM and Intraface-SVM were used for evaluation. For each resolution, the area un- The experiment with EAR SVM is done in a cross- der the precision-recall curve (AUC) was computed. dataset fashion. It means that the SVM classifier is The result is shown in Fig. 9. We can see that with trained on the Eyeblink8 and tested on the ZJU and Chehra landmarks the accuracy remains very high vice versa. until average IOD is about 30 px. The detector fails To evaluate detector accuracy, predicted blinks are on images with the IOD < 20 px. Intraface land- compared with the ground-truth blinks. The number marks are much better in low resolutions. This con- of true positives is determined as a number of the firms our previous study on the accuracy of land- ground-truth blinks which have a non-empty inter- marks in Sec. 3.1. 100 1 A B 90 0.9 C 0.8 80 0.7 70 0.6 60 0.5 AUC 50 0.4 0.3 40 Precision [%] 0.2 30 0.1 Chehra SVM Intraface SVM 20 0 57.38 51.6 45.9 40.2 34.4 28.7 23.0 17.2 11.5 5.7 EAR Thresholding IOD [px] 10 Chehra SVM Intraface SVM 0 Figure 9: Accuracy of the eye blink detector (mea- 0 10 20 30 40 50 60 70 80 90 100 Recall [%] sured by AUC) as a function of the image resolution (average IOD) when subsampling the ZJU dataset. (a) ZJU 100 phenomena as non-frontality, bad illumination, facial 90 expressions, etc. 80 A State-of-the-art on two standard datasets was 70 achieved using the robust landmark detector fol- lowed by a simple eye blink detection based on the 60 SVM. The algorithm runs in real-time, since the ad- 50 ditional computational costs for the eye blink detec- 40 Precision [%] tion are negligible besides the real-time landmark de- tectors. 30 The proposed SVM method that uses a temporal 20 window of the eye aspect ratio (EAR), outperforms EAR Thresholding 10 the EAR thresholding. On the other hand, the thresh- Chehra SVM Intraface SVM olding is usable as a single image classifier to detect 0 0 10 20 30 40 50 60 70 80 90 100 the eye state, in case that a longer sequence is not Recall [%] available. (b) Eyeblink8 We see a limitation that a fixed blink duration for all subjects was assumed, although everyone’s blink Figure 8: Precision-recall curves of the EAR thresh- lasts differently. The results could be improved by an olding and EAR SVM classifiers measured on (a) the adaptive approach. Another limitation is in the eye ZJU and (b) the Eyeblink8 databases. Published re- opening estimate. While EAR is estimated from a 2D sults of methods A - Drutarovsky and Fogelton [8], B image, it is fairly insensitive to a head orientation, but - Lee et al. [9], C - Danisman et al. [5] are depicted. may lose discriminability for out of plane rotations. A solution might be to define the EAR in 3D. There 4. Conclusion are landmark detectors that estimate a 3D pose (po- sition and orientation) of a 3D model of landmarks, A real-time eye blink detection algorithm was e.g. [1, 3]. presented. We quantitatively demonstrated that regression-based facial landmark detectors are pre- Acknowledgment cise enough to reliably estimate a level of eye open- ness. While they are robust to low image quality (low The research was supported by CTU student grant image resolution in a large extent) and in-the-wild SGS15/155/OHK3/2T/13. References [16] X. Xiong and F. De la Torre. Supervised descent methods and its applications to face alignment. In [1] A. Asthana, S. Zafeoriou, S. Cheng, and M. Pantic. Proc. CVPR, 2013. 2, 3, 4, 5 Incremental face alignment in the wild. In Confer- [17] Z. Yan, L. Hu, H. Chen, and F. Lu. Computer vision ence on Computer Vision and Pattern Recognition, syndrome: A widely spreading but largely unknown 2014. 1, 2, 3, 4, 5, 7 epidemic among computer users. Computers in Hu- [2] L. M. Bergasa, J. Nuevo, M. A. Sotelo, and man Behaviour, (24):2026–2042, 2008. 1 M. Vazquez. Real-time system for monitoring driver [18] F. Yang, X. Yu, J. Huang, P. Yang, and D. Metaxas. vigilance. In IEEE Intelligent Vehicles Symposium, Robust eyelid tracking for fatigue detection. In ICIP, 2004. 1 2012. 1 [3] J. Cech, V. Franc, and J. Matas. A 3D approach to facial landmarks: Detection, refinement, and track- [19] S. Zafeiriou, G. Tzimiropoulos, and M. Pantic. The ing. In Proc. International Conference on Pattern 300 videos in the wild (300-VW) facial landmark Recognition, 2014. 7 tracking in-the-wild challenge. In ICCV Work- shop, 2015. http://ibug.doc.ic.ac.uk/ [4] M. Chau and M. Betke. Real time eye tracking and resources/300-VW/. 3 blink detection with USB cameras. Technical Report 2005-12, Boston University Computer Science, May 2005. 1 [5] T. Danisman, I. Bilasco, C. Djeraba, and N. Ihad- dadene. Drowsy driver detection system using eye blink patterns. In Machine and Web Intelligence (ICMWI), Oct 2010. 1, 6, 7 [6] H. Dinh, E. Jovanov, and R. Adhami. Eye blink detection using intensity vertical projection. In In- ternational Multi-Conference on Engineering and Technological Innovation, IMETI 2012. 1 [7] M. Divjak and H. Bischof. Eye blink based fa- tigue detection for prevention of computer vision syndrome. In IAPR Conference on Machine Vision Applications, 2009. 1 [8] T. Drutarovsky and A. Fogelton. Eye blink detec- tion using variance of motion vectors. In Computer Vision - ECCV Workshops. 2014. 1, 2, 5, 6, 7 [9] W. H. Lee, E. C. Lee, and K. E. Park. Blink detec- tion robust to various facial poses. Journal of Neu- roscience Methods, Nov. 2010. 1, 3, 6, 7 [10] Medicton group. The system I4Control. http:// www.i4tracking.cz/. 1 [11] G. Pan, L. Sun, Z. Wu, and S. Lao. Eyeblink-based anti-spoofing in face recognition from a generic we- bcamera. In ICCV, 2007. 1, 2, 5 [12] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 fps via regressing local binary features. In Proc. CVPR, 2014. 2 [13] A. Sahayadhas, K. Sundaraj, and M. Murugappan. Detecting driver drowsiness based on sensors: A re- view. MDPI open access: sensors, 2012. 1 [14] F. M. Sukno, S.-K. Pavani, C. Butakoff, and A. F. Frangi. Automatic assessment of eye blinking pat- terns through statistical shape models. In ICVS, 2009. 1, 2 [15] D. Torricelli, M. Goffredo, S. Conforto, and M. Schmid. An adaptive blink detector to initial- ize and update a view-basedremote eye gaze track- ing system in a natural scenario. Pattern Recogn. Lett., 30(12):1144–1150, Sept. 2009. 1 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 Solving Dense Image Matching in Real-Time using Discrete-Continuous Optimization Alexander Shekhovtsov, Christian Reinbacher, Gottfried Graber and Thomas Pock Institute for Computer Graphics and Vision, Graz University of Technology {shekhovtsov,reinbacher,graber,pock}@icg.tugraz.at Abstract. Dense image matching is a fundamental low- Input level problem in Computer Vision, which has received tremendous attention from both discrete and continuous optimization communities. The goal of this paper is to combine the advantages of discrete and continuous op- timization in a coherent framework. We devise a model based on energy minimization, to be optimized by both discrete and continuous algorithms in a consistent way. Discrete In the discrete setting, we propose a novel optimization algorithm that can be massively parallelized. In the con- tinuous setting we tackle the problem of non-convex reg- ularizers by a formulation based on differences of convex functions. The resulting hybrid discrete-continuous algo- rithm can be efficiently accelerated by modern GPUs and we demonstrate its real-time performance for the applica- tions of dense stereo matching and optical flow. Continuous 1. Introduction The dense image matching problem is one of the most basic problems in computer vision: The goal is to find matching pixels in two (or more) images. The applica- tions include stereo, optical flow, medical image registra- Combined tion, face recognition [1], etc. Since the matching problem is inherently ill-posed, typically optimization is involved in solving it. We can distinguish two fundamentally dif- ferent approaches: discrete and continuous optimization. Whereas discrete approaches (see [14] for a recent com- parison) assign a distinct label to each output pixel, con- tinuous approaches try to solve for a function using the calculus of variations [6, 8, 21]. Both approaches have data term Large motion Parallelization received enormous attention, and there exist state-of-the- Discrete Arbitrary (sampled) Easy Difficult art algorithms in both camps: continuous [23, 24, 28] and Continuous Convex (linearized) Difficult Easy discrete [18, 30]. Due to the specific mathematical tools available to solve the problems (discrete combinatorial op-Figure 1: Optical flow problem solved by a purely discrete method, a purely continuous method and the combined method. timization vs. continuous calculus of variations), both ap- All methods are as described in this paper, they use the same data proaches have distinct advantages and disadvantages. term and are run until convergence here. In the discrete solu- In this paper, we argue that on a fundamental level the tion we can see small scale details and sharp motion boundaries advantages and disadvantages of discrete and continuous but also discretization artifacts. The continuous solution exhibits optimization for dense matching problems are comple- sub-pixel accuracy (smoothness), but lacks small details and has mentary as summarized in Figure 1. The previous work difficulties with large motions. The combined solution delivers combining discrete and continuous optimization primar- smooth flow fields while retaining many small scale details. ily used discrete optimization to fuse (find the optimal crossover) of candidate continuous proposals, e.g. [36, 30] optimization method, reducing non-convex regularizers to (stereo) and [25] (flow). The latter additionally per- a primal-dual method with non-linear operators [31]; iii) forms local continuous optimization of the so-found solu-an efficient implementation of both methods on GPU and tion. Many works also alternate between continuous and proof of concept experiments showing advantages of the discrete optimizations, addressing a Mumford-Shah-like combined approach. model, e.g., [5]. Similarly to [25] we introduce a continuous energy which is optimized using a combined method. 2. Method However, we work with a full (non-local) discretization of In this section we will describe our two-step approach this model and propose new parallel optimization meth- to the dense image matching problem. To combine the ods. previously discussed advantages of discrete and continu- The basic difference in discrete and continuous ap- ous optimization methods it is essential to minimize the proaches lies in the handling of the data term. The data same energy in both optimization methods. Starting from term is a measure how well the solution (i.e. value of a a continuous energy formulation in § 2.1, we first show pixel) fits the underlying measurement (i.e. input images). how to discretize the energy in § 2.2 and subsequently In the discrete setting, the solution takes discrete labels, minimize it using a novel discrete parallel block coordi- and hence the number of labels is finite. Typically the nate descent, described in § 2.3. The output of this algo-data cost is precomputed for all possible labels. The dis- rithm will be the input to a refinement method which is crete optimization then uses the data cost to find the opti- posed as a continuous optimization problem, solved by a mal label for each pixel according to a suitable model in non-linear primal-dual algorithm described in § 2.4. an energy minimization framework. We point out that due to the sampling in both label space and spatial domain, the 2.1. Model discrete algorithm has access to the full information at ev- Let us formally define the dense image matching prob- ery step. I.e. it deals with a global optimization model and lem to be addressed by the discrete-continuous optimiza- in some lucky cases can find a globally optimal solution to tion approach. In both formulations we consider that the it or provide an approximation ratio or partial optimality image domain is a discrete set of pixels guarantees [27]. V. The continuous formulation has continuous ranged variables u = (uk In the continuous setting, the solution is a continuous i ∈ function. This means it is not possible to precompute the R | k = 1, . . . d, i ∈ V), where d = 1, 2 for stereo / flow, respectively. The matching problem is formulated as data cost; an infinite number of solutions would require infinite amount of memory. More importantly, the data h i min E(u) = D(u) + R(Au) , (1) cost is a non-convex function stemming from the similar- u∈U ity measure between the images. In order to make the where U = d×V ; D is the data term and R(Au) is a optimization problem tractable, a popular approach is the R regularizer (A is a linear operator explained below). The linearization of the data cost. However, this introduces discrete formulation will quantize variable ranges. a range of new problems, namely the inability to deal Data Term We assume D(u) = P D with large motions due to the fact that the linearization is i∈V i(ui), where D d valid only in a small neighborhood around the lineariza- i : R → R encodes the deviation of ui from some un- tion point. Most continuous methods relying on lineariza- derlying measurement. A usual choice for dense image tion therefore use a coarse-to-fine framework in an attempt matching are robust filters like Census Transform or Nor- to overcome this problem [4]. One exception is a recent malized Cross Correlation, computed on a small window work [16], which can handle piece-wise linear data terms around a pixel. This data term is non-convex in u and and truncated TV regularization. piecewise linear. In the discrete setting, the data term is Our goal in this paper is to combine the advantages of sampled at discrete locations, in the continuous setting, both approaches, as well as real-time performance, which the data term is convexified by linearizing or approximat- imposes tough constraints on both methods resulting in a ing it around the current solution. The details will be de- number of challenges: scribed in the respective sections. Challenges Regularization Term The discrete optimization method needs to The regularizer encodes prop- be highly parallel and able to erties of the solution of the energy minimization like local couple the noisy / ambigu- ous data over large areas. The continuous energy should smoothness or preservation of sharp edges. The choice of be a refinement of the discrete energy so that we can evalu- this term is crucial in practice, since the data term may be ate the two-phase optimization in terms of a single energy unreliable or uninformative in large areas of dense match- function. The continuous method needs to handle robust ing problems. We assume (truncated) regularization terms. d X X Contribution Towards the posed challenges, we pro- R(Au) = ωij r((Auk)ij), (2) pose: i) a new method for the discrete problem, working in ij∈E k=1 the dual (i.e. making equivalent changes of the data cost where E ⊂ V × V is the set of edges, i.e., pairs of volume), in parallel on multiple chains; ii) a continuous neighboring pixels; linear operator A : V E R → R : uk 7→ (uk Stereo We discretize a range of disparities and let i − uk j ∈ R | ∀ij ∈ E ) essentially computes gradients along the edges in E for the solution dimension k; the gra- u(x) ∈ V R denote the continuous solution correspond- dients are penalized by the penalty function r : R → R ing to the labeling x. We set fi(xi) = Di(u(xi)) and and ωij are image dependent per-edge strength weights, fij(xi, xj) = ωijr((Au(x))ij). reducing the penalty around sharp edges. Our particular Flow Discretization of the flow is somewhat more choice for the penalty function r is depicted in Fig. 2. We challenging. Since ui is a 2D vector, assuming large dis-chose to use a truncated norm which has shown to be ro- placements, discretizing all combinations is not tractable. bust against noise that one typically encounters in dense Instead, components u1 and u2 can be represented as sep- i i matching problems. It generalizes truncated Total Vari- arate discrete variables xi1 , xi2 , where (i1, i2) is a pair ation in the continuous setting. In the discrete setting it of nodes duplicating i, leading to the decomposed formu- generalizes the P 1-P 2 penalty model [11], Potts model lation [26]. To retain the pairwise energy form (3), this and the truncated linear model. approach assigns the data terms Di(ui) to a pairwise cost fi1i2(xi1, xi2) and the regularization is imposed on each r layer of variables (xi1 | i ∈ V) and (xi2 | i ∈ V) sepa- +(t) 4 rately. To this end, we tested a yet simpler representation, r−(t) in which we assign optimistic data costs, given by r(t) fi1(xi1) = minx Di(xi1, xi2), (5a) i2 2 fi2(xi2) = minx Di(xi1, xi2), (5b) i1 C where D δ i(xi1 , xi2 ) is the discretized data cost, and reg- εδ 0 ularize in each layer individually. This makes the two −4 −2 0 2 4 layers fully decouple into, essentially, a two indepen- dent stereo-like problems. At the same time, the cou- Figure 2: Regularizer function r. In our continuous optimiza- pled scheme [26], passing messages between the two lay- tion method it is decomposed into a difference of convex func- ers, differs merely in recomputing (5) for a reparametrized tions r+ −r−. For the discrete optimization it is sampled at label data costs in a loop. Our simplification then is not a prin- locations depicted as dots. cipled limitation but an intermediate step. 2.3. Discrete Optimization 2.2. Discrete Formulation In this section we give an overview of a new method In the discrete representation we will use the following under development addressing problem (4) through its formalism. To a continuous variable u LP-relaxation dual. In real-time applications like stereo i we associate a discrete variable x and flow there seem to be a demand in methods per- i ∈ L. The discrete label space L can be chosen to our convenience as long as it has the desired forming fast approximate discrete optimization, prefer- number of elements, denoted K. We let L to be vectors ably well-parallelizable. It has motivated a significant re- in {0, 1}K with exactly one component equal 1 (the 1-hot search. The challenge may sound as “best solution in a encoding of natural numbers from 1 to K). For f K limited time budget”. i ∈ R we denote f K×K Well-performing methods, from local to global, range i(xi) = hfi, xii = f Tx i i and for fij ∈ R we denote f from cost volume filtering [12], semi-global match- ij (xi, xj ) = xTf i ij xj . Let f = (fw | w ∈ V ∪ E) denote the energy cost vector. The energy function ing (SGM) [11] (has been implemented in GPU and corresponding to the cost vector f is given by FPGA [2]), dynamic programming on spanning trees ad- justing the cost volume [3] and more-global matching X X f (x) = fi(xi) + fij(xi, xj). (3) (MGM) [10] to the sequential dual block coordinates i∈V ij∈E methods, such as TRW-S [15]. Despite being called se- quential, TRW-S exhibits a fair amount of parallelism in Whenever we need to refer to f as a function and not as its computation dependency graph, which is exploited in the cost vector, we will always use the argument notation, the parallel GPU/FPGA implementations [7, 13]. At the e.g. f (x) ≥ g(x) is different from f ≥ g. same time SGM has been interpreted [9] as a single step Energy function f that can be written as P f i i(xi) = of parallel TRW algorithm [32] developed for solving the hf, xi is called modular, separable or linear. Formally, dual. MGM goes further in this direction, resembling even all components fij of f are identically zero. If fij is non- more the structure of a dual solver: it combines together zero only for a subgraph of (V, E) which is a set of chains, more messages but in a heuristic fashion and introducing we say that f is a chain. more computation dependencies, in fact similar to TRW- The discrete energy minimization problem is defined as S. It appears that all these approaches go somehow in the direction of a fast processing of the dual. min f (x). (4) x∈LV We propose a new dual update scheme, which: i) is a monotonous block-coordinate ascent; ii) performs as good as TRW-S for an equal number of iterations while having a Algorithm 1: Primal MM comparable iteration cost; and iii) offers more parallelism, Input: Initial primal point xk; better mapping to current massively parallel compute ar- Output: New primal point xk+2; chitectures. Thus it bridges the gap between highly paral- ¯ 1 f f , ¯ f (xk) = f (xk); /* Majorize */ lel heuristics and the best “sequential” dual methods with- 2 xk+1 ∈ argmin( ¯ f + g)(x); / out compromising on the speed and performance. * Minimize */ x On a higher level, the method is most easily presented 3 ¯ g g, ¯g(xk+1) = g(xk+1); /* Majorize */ in the dual decomposition framework. For clarity, let us 4 xk+2 ∈ argmin(f + ¯ g)(x); /* Minimize */ consider a decomposition into two subproblems only (hor- x izontal and vertical chains). Consider minimizing the en- ergy function E(x) that separates as lem: h i E(x) = f (x) + g(x), (6) max min f (x) + hλ, xi + min g(x) − hλ, xi . (8) λ x x | {z } | {z } where f, g : LV → R are chains. D1(λ) D2(λ) Primal Majorize-Minimize Even before introducing the dual, we can propose applying the majorize-minimize The so-called slave problems D1(λ) and D2(λ) have the method (a well-known optimization technique) to the pri- form of minimizing an energy function with a data cost mal problem in the form (6). It is instructive for the subse-modified by λ. The goal of the master problem (8) is to quent presentation of the dual method and has an intrigu- balance the data cost between the slave problems such that ing connection to it, which we do not yet fully understand. their solutions agree. The slave problems are minima of finitely many functions linear in λ, the objective of the Definition 2.1. A modular function ¯ f is a majorant (up- master problem (8) D(λ) = D1(λ) + D2(λ) is thus a per bound) of f if (∀x) ¯ f (x) ≥ f(x), symbolically concave piece-wise linear function. Problem (8) is a con- ¯ f f. A modular minorant f of f is defined similarly.1 ¯ cave maximization. However, since x was taking values in Noting that minimizing a chain function plus a modu- a discrete space, there is only a weak duality: (7) ≥ (8). It lar function is easy, one could straightforwardly propose is known that (8) can be written as a linear program (LP), Algorithm 1, which alternates between majorizing one of which is as difficult in terms of computation complexity f or g by a modular function and minimizing the result- as a general LP [22]. ing chain problem ¯ f + g (resp. f + ¯ g). We are not aware Dual Minorize-Maximize In the dual, which is a of this approach being evaluated before. Somewhat novel, maximization problem, we will speak of a minorize- the sum of two chain functions is employed rather than, maximize method. The setting is similar to the primal. say, difference of submodular [19], but the principle is the We can efficiently maximize D1, D2 but not D1 + D2. same. To ensure monotonicity of the algorithm we need Suppose we have an initial dual point λ0 and let x0 ∈ to pick a majorant ¯ f of f which is exact in the current pri- argmin (f + λ0)(x) x be a solution to the slave subprob- mal solution xk as in Line 1. Then f (xk+1) + g(xk+1) ≤ lem D1, that is, D1(λ0) = f (x0) + λ0(x0). ¯ f (xk+1) + g(xk+1) ≤ ¯ f (xk) + g(xk) = f (xk) + g(xk). Proposition 2.2. Let f be a modular minorant of f exact Steps 3-4 are completely similar. Algorithm 1 has the fol- ¯ in x0 and such that f + λ0 ≥ D1(λ0) (component-wise). lowing properties: ¯ Then the function D1(λ) = min f +λ)(x) is a minorant • primal monotonous; ¯ x(¯ • of D1(λ) exact at λ = λ0. parallel, since, e.g., minx( ¯ f + g)(x) decouples over all vertical chains; Proof. Since f (x) ≤ f (x) for all x it follows that • uses more information about subproblem f than ¯ minx(f + λ)(x) ≤ minx(f + λ)(x) for all λ and there- just the optimal solution (as in most primal block- ¯ fore D1 is a minorant of D1. Next, on one hand we coordinate schemes: ICM, alternating lines, etc.). ¯ have D1(λ0) ≤ D1(λ0) and on the other, D1(λ0) ≤ The performance of this method highly depends on the ¯ (f + λ0)(x) for all x and thus D1(λ0) ≤ D1(λ0). strategy of choosing majorants. This will be also the main ¯ ¯ question to address in the dual setting. We have constructed a minorant of D1 which is itself Dual Decomposition Minimization of (6) can be writ- a (simple) piece-wise linear concave function. The maxi- ten as mization step of the minorize-maximize is to solve min f (x1) + g(x2). (7) max(D1(λ) + D2(λ)). (9) ¯ x1=x2 λ Proposition 2.3. λ∗ = f is a solution to (9). Introducing a vector of Lagrange multipliers λ ∈ L×V − R ¯ for the constraint x1 = x2, we get the Lagrange dual prob- Proof. Substituting λ∗ into the objective (9) we obtain D1(λ∗) + D2(λ∗) = min f f )(x) + D2( f ) = 1 − − f reads “f underbar”. ¯ x( ¯ ¯ ¯ ¯ minx(f + g)(x). This value is the maximum because ¯ Algorithm 2: Dual MM allows to perform 5 iterations of Algorithm 2 for an image Input: Initial dual point gk; 512×512 and 64 labels at the rate of about 30 fps. ¯ Output: New dual point gk+2; ¯ 9500 1 xk ∈ argmin (f + gk)(x) TRW−S x ; /* Minimize */ ¯ TRW−S primal / 9000 * Minorize */ DMM−uniform DMM−uniform primal 2 f k+1 f , f k+1(xk) = f (xk), ¯ ¯ 8500 DMM−naive f k+1 + gk ≥ f(xk) + gk(xk); DMM−naive primal ¯ ¯ ¯ 8000 3 xk+1 ∈ argmin (f k+1 + g)(x) x ; /* Minimize */ ¯ /* Minorize */ 7500 4 gk+2 g, gk+2(xk+1) = g(xk+1), ¯ ¯ f k+1 + gk+2 ≥ fk+1(xk+1) + g(xk+1); 7000 ¯ ¯ ¯ 6500 D1(λ) + D2(λ) = min f + λ)(x) + min ¯ x( x(g − λ)(x) ≤ ¯ 6000 minx(f + λ + g − λ)(x) = minx(f + g)(x). ¯ ¯ 5500 Note, for the dual point λ = −f, in order to construct 0 2 4 6 8 10 12 14 16 18 20 ¯ a minorant of D2 (similarly to Proposition 2.2) we need Figure 3: Lower bounds and best primal solutions by TRW-S to find a solution to the second slave problem, and by Dual MM with a naive and a uniform minorants. The problem is a small crop from stereo of size 40×40, 16 labels, x1 ∈ argmin(g − λ)(x) = argmin(f + g)(x). (10) ¯ truncated linear regularization. On the x-axis one iteration is We obtain Algorithm 2 with the following properties: a forward-backward pass of TRW-S vs. iteration of Dual MM • (equal number of updates per pixel). With a good choice of mi- It builds the sequence of dual points given by λ2t = norant, Dual MM can perform even better than the sequential g2t, λ2t+1 = −f2t+1 and the dual objective does not ¯ ¯ baseline in terms of iterations. Parallelizing it can be expected decrease on each step; to give a direct speedup. • The minimization subproblems and minorants are decoupled (can be solved in parallel) for all horizon- tal (resp. vertical) chains; 2.4. Continuous Refinement • When provided good minorants (see below) the algo- rithm has same fixed points as TRW-S [15]; In this section we describe the continuous refinement • Updating only a single component λ method, which is based on variational energy minimiza- i for a pixel i is a monotonous step as well, therefore the algorithm is tion. The goal of this step is to refine the output of the a parallel block-coordinate ascent. optimization method described in § 2.3 which is discrete Notice also that Dual MM and Primal MM are very sim-in label-space. ilar, nearly up to replacing minorants with majorants. The To that end, it is important to minimize the same en- sequence {E(xk)} ergy in both formulations. Considering the optimization k is monotonous in Algorithm 1 but not in Algorithm 2. problem in (1), we are seeking to minimize a non-convex, Good and Fast Minorants The choice of the mino-truncated norm together with a non-convex data term. For rant in Dual MM is non-trivial as there are many, which clarity, let us write down the problem again: makes it sort of a secrete ingredient. Figure 3 illustrates min D(u) + R(Au). (11) two of the possible choices. The naive minorant for a u∈U chain problem f + λ is constructed by calculating its min- marginals and dividing by chain length to ensure that the Non-Convex Primal-Dual Efficient algorithms exist simultaneous step is monotonous (c.f . tree block update to solve (11) in case both D(u) and R(Au) are convex algorithm of Sontag and Jaakkola [29, Fig. 1]). The uni- (but possibly non-smooth), e.g. the primal-dual solver of form minorant is found through the optimization proce- Chambolle and Pock [6]. Kolmogorov et al. [16] solves dure that tries to build the tightest modular lower bound, (11) for a truncated total variation regularizer using a split-by increasing uniformly all components that are not yet ting into horizontal and vertical 1D problems and applying tight. The details are given in §A. In practice, we build [6] to the Lagrangian function. Here we will use a recently fast minorants, which try to approximate the uniform one proposed extension to [6] by Valkonen [31]. He considers using fast message passing operations. Parallelization of problems of the form minx G(x) + F(A(x)), i.e. of the decoupled chains allowed us to achieve an implementa- same structure as (11), where G and F are convex, G is tion which, while having the same number of memory ac- differentiable and A(u) is a twice differentiable but pos- cesses as TRW-S (including messages / dual variables), sibly non-linear operator. In the primal-dual formulation, saturates the GPU memory bandwidth, ∼ 230GB/s.2 This the problem is written as 2This is about 10 times faster than reported for FPGA implementa- min max G(x) + hA(x), yi − F∗(y), (12) tion [7] of TRW-S. x y where ∗ is the convex conjugate. Valkonen proposes the To compute the proximal map (I + σ∂F∗)−1(ˆ y) we following modified primal-dual hybrid gradient method: first need the convex conjugate of ωijrα,β(t). It is given by (ωijrα,β)∗(t∗) = xk+1 =(I + τ ∂G)−1(xk − τ ∇A(xk)T yk) (13a) (max(0, β|t∗| − ωijαβ) if α < |t∗| < ωij yk+1 =(I + σ∂F∗)−1(yk + σA(2xk+1 − xk)). (13b) . (21) ∞ else Reformulation In order to apply method [31], we will The proximal map for (ωijrα,β)∗ at t∗ ∈ R is given by reformulate the non-convex problem (11) to the form (12). ¯ t = clamp(±ωij, t0), where clamp(±ωij, ·) denotes a We start by formulating the regularizer R(Au) as a differ- clamping to the interval [−ωij, ωij] and ence of convex functions: R(Au) = R+(Au) − R−(Au), ( where R+ and R− are convex. The primal-dual formula- t∗ if |t∗| ≤ αωij t0 = tion of (11) then reads (22) max(αωij, |t∗|−βσ) sign(t∗) else. h min max(hAu, pi − R∗ Proximal map (I + σ∂ +(p)) (14) F∗)−1(ˆy) is calculated by applying u p expression (22) component-wise to ˆ y. The proximal map i + max(hAu, qi − R∗−(q)) + D(u) . (I + τ ∂G)−1 depends on the choice of the data term D(u) q and will thus be defined in § 3. Because minx −f(x) = − maxx f(x), (14) equals 3. Applications h min max(hAu, pi − R∗+(p))+ (15) 3.1. Stereo Reconstruction u p i + min(−hAu, qi + R∗ For the problem of estimating depth from two images, −(q)) + D(u) . q we look at a setup of two calibrated and synchronized cameras. We assume that the input images to our method Grouping terms we arrive at have been rectified according to the calibration parameters h i of the cameras. We aim to minimize the energy (1) where min max hAu, p−qi−R∗+(p)+R∗−(q)+D(u) . (16) u,q p u encodes the disparity in x-direction. The data term mea- sures the data fidelity between images I1 and I2, warped The problem now arises in minimizing the bilinear term by the disparity field u. As a data term we use the Census hAu, qi in (16) in both u and q. We thus move this term Transform [37] computed on a small local patch in each into the nonlinear operator A(x) and rewrite (16) as image. The cost is given by the pixel-wise Hamming dis- tance on the transformed images.D(u) is non-convex in * + Au p the argument u which makes the optimization problem in min max , + R∗−(q) + D(u) u,q p,d=1 −hAu, qi d (1) intractable in general. | {z } |{z} | {z } | {z } G(x) We start by minimizing (1) using the discrete method x y A(x) (§2.3) in order to obtain an initial solution ˚ u. We approx- − R∗+(p) (17) imate the data term around the current point ˚ u by a piece- | {z } wise linear convex function ˜ D(u) = F∗(y) ( by introducing a dummy variable d = 1. s D(˚ u) + δ 1(u − ˚ u) if u ≤ ˚ u [˚ u−h,˚ u+h](u) + (23) Implementation Details The gradient of A needed by s2(u − ˚ u) otherwise iterates (13) is given by with s1 = D(˚ u+h)−D(˚ u) and s for a h 2 = D(˚ u)−D(˚ u+h) h A 0 small h. To ensure convexity, we set s if ∇A(x) = . 1 = s2 = s1+s2 (18) 2 −ATq −Au s2 < s1. The indicator function δ is added to ensure that the solution stays within ˚ u ± h where the approximation The regularization function r is represented as a difference is valid. We then apply the continuous method (§2.4). The of two convex functions (see Figure 2): proximal map ¯ u = (I + τ ∂G)−1(û) needed by the algo- rithm (13) for the approximated data term expresses as the r(t) = rε,δ(t) − r0,(C+δ−εδ)(t), (19) pointwise soft-thresholding where  τs   1,i if ûi > ˚ ui + τ s1,i (  α|t| if |t| ≤ β ¯ u ˚ u  r i = clamp i ± h, ûi − τs2,i if ûi < ˚ ui + τ s2,i α,β (t) = (20)   |t| − β(1 − α) else  0 otherwise is convex for α ≤ 1. Convex functions R+(Au) and In practice, the minimization has to be embedded in a R−(Au) are defined by decomposition (19) and (2). warping framework: after optimizing for n iterations, the data term is approximated anew at the current solution u. 3.2. Optical Flow (a) Input (b) Groundtruth The optical flow problem for two images I1, I2 is posed again as model (1). In contrast to stereo estimation, we now have u 2 i ∈ R encoding the flow vector. For the discrete optimization step (§2.3) the flow problem is decoupled into two independent stereo-like problems as dis- cussed in §2.2. (c) TV regularization (d) Proposed Method For the continuous refinement step, the main prob- lem is again the non-convexity of the data term. In- stead of a convex approximation with two linear slopes we build a quadratic approximation, now in 2D, follow- ing [34]. The approximated data term reads ˜ Di(ui) = δ[˚ ui−h,˚ ui+h](ui)+ 1 Figure 4: Influence of the robust regularizer in the continuous Di(˚ ui) + LT(u (u i i − ˚ ui) + refinement on stereo reconstruction quality. 2 i − ˚ ui)TQi(ui − ˚ ui), (24) (a) Refinement (b) No Refinement where L 2 2×2 i ∈ R and Qi ∈ R are finite difference ap- proximations of the gradient and the Hessian with step- size h. Convexity of (24) is ensured by retaining only positive-semidefinite part of Qi as in [34]. The proximal map ¯ u = (I + τ ∂G)−1(û) for data term (24) is given point-wise by ûk + τ (Q ¯ uk = clamp ˚ uk i i˚ ui − Li)k . Figure 5: Influence of continuous refinement on the reconstruc- i i ± h, (25) 1 + τ Lki tion quality of KinectFusion. Optimizing (1) is then performed as proposed in §2.4. For the purpose of this experiment we replace the Kinect 4. Experiments with a Point Grey Bumblebee2 stereo camera. KinectFu- sion can only handle relatively small camera movements 4.1. Stereo Reconstruction between images, so a high framerate is essential. We set We evaluate our proposed real-time stereo method on the parameters to our method to achieve a compromise datasets where Ground-Truth data is available as well as between highest quality and a framerate of ≈ 4 − 5 fps: on images captured using a commercially available stereo camera resolution 640 × 480, 128 disparities, 4 iterations camera. of Dual MM, 5 warps and 40 iterations per warp of the continuous refinement. Influence of Continuous Refinement The first stage 4.1.1 Influence of Truncated Regularizer of our reconstruction method, Dual MM, already delivers We begin by comparing the proposed method to a sim- high quality disparity images that include details on fine plified version that does not use a truncated norm as reg- structures and depth discontinuities that are nicely aligned ularizer but a standard Total Variation. We show the ef- with edges in the image. In this experiment we want to fect of this change in Fig. 4, where one can observe much show the influence of the second stage, the continuous re-sharper edges, when using a robust norm in the regulariza- finement, on the reconstruction quality of KinectFusion. tion term. On the downside it is more sensitive to outliers, To that end we mount the camera on a tripod and collect which however can be removed in a post-processing step 300 depthmaps live from our full method and 300 frames like a two-side consistency check. with the continuous refinement switched off. By switch- ing off the camera tracking, the final reconstruction will 4.1.2 Live Dense Reconstruction show us the artifacts produced by the stereo method. Fig- ure 5 depicts the result of this comparison. One can easily To show the performance of our stereo matching method see that the output of the discrete method contains fine de- in a real live setting, we look at the task of creating a tails, but suffers from staircasing artifacts on slanted sur- live dense reconstruction from a set of depth images. To faces due to the integer solution. The increase in qual- that end, we are using a reimplementation of KinectFusion ity due to the refinement stage can be especially seen on proposed by Newcombe et al. [20] together with the out- far away objects, where a disparity step of 1 pixel is not put of our method. This method was originally designed enough to capture smooth surfaces. to be used with the RGBD output of a Microsoft Kinect Timing To show the influence of the individual steps and tracks the 6 DOF position of the camera in real-time. in our stereo method on runtime, we break down the total Cost Vol. Discrete Cont. Ref. Total Inputs 27 ms 73 ms 39 ms 139 ms Table 1: Runtime analysis of the individual components of our stereo matching method. Details regarding computing hardware Werlberger [33] Combined and parameters are in the text. In case of the full left-right check procedure the total computation time doubles. (a) Input (b) Reconstruction Figure 6: Qualitative result of reconstructing a desktop scene using KinectFusion3. time of ≈ 140 ms per frame in Table 1. Those timings have been achieved using a PC with 32 GB RAM with a NVidia 980GTX, running Linux. Qualitative Results To give an impression about the quality of the generated depthmaps and the speed of our method, we run our full algorithm and aim to reconstruct a desktop scene with a size of 1 × 1 × 1 meters and show some renderings in Fig. 6. To better visualize the quality Figure 7: Subjective comparison of variational approach [33] of the geometry, the model is rendered without texture3. (left) with our combined method (right). Top row show input 4.2. Optical Flow images, one from a pair. Both methods use the same data term. Parameters of both algorithms have been tuned by hand to de- In this section we show preliminary results of our liver good results. Note that for [33] it is often impossible to get algorithm applied to optical flow. A further improve- sharp motion boundaries as well as small scale details, despite a ment in quality can be expected by exploiting the coupled very strong data term (e.g. artifacts in left image, first row). scheme [26] in the discrete optimization, as discussed in § 2.2. As depicted in Figure 7, our method is able to de-tion is sufficiently localized, continuous representation in- liver reasonable results on a variety of input images. We creases the accuracy of the model as well as optimization deliberately chose scenes that contain large motion as well speed. In the continuous optimization, we experimented as small scale objects, to highlight the strengths of the with non-convex models and showed a reduction allowing discrete-continuous approach. For comparison, we use a to handle them with the help of a recent non-linear primal- state-of-the-art purely continuous variational optical flow dual method. This in turn allowed to speak of a global algorithm [33]. The runtime of our method is 2s for an model to be solved by a discrete-continuous optimization. image of size 640 × 480. Ideally, we would like to achieve a method, which, when given enough time, produces an accurate solution, 5. Conclusion and in the real time setting gives a robust result. We plan further to improve on the model. A vast literature on the The current results demonstrate that it is feasible to topic suggest that modeling occlusions and using planar solve dense image matching problems using global op- hypothesis can be very helpful. At the same time, we are timization methods with a good quality in real time. We interested in a tighter coupling of discrete and continuous have proposed a highly parallel discrete method, which optimization towards a globally optimal solution. even when executed sequentially, is competitive with the best sequential methods. As a dual method, we believe, Acknowledgements it has a potential to smoothly handle more complex mod- els in the dual decomposition framework and is in theory This work was supported by the research initiative applicable to general graphical models. When the solu- Mobile Vision with funding from the AIT and the Aus- 3 trian Federal Ministry of Science, Research and Economy We point the interested reader to a video that shows the reconstruc- tion pipeline in real-time: http://gpu4vision.icg.tugraz. HRSM programme (BGBl. II Nr. 292/2012). at/videos/cvww16.mp4 References [17] Lawler, E. (1966). Optimal cycles in doubly weighted di- rected linear graphs. In Intl Symp. Theory of Graphs. [1] Arashloo, S. R. and Kittler, J. (2014). Fast pose invariant face recognition using super coupled multiresolution Markov [18] Menze, M., Heipke, C., and Geiger, A. (2015). Discrete random fields on a GPU. Pattern Recognition Letters, 48. optimization for optical flow. In GCPR. [2] Banz, C., Hesselbarth, S., Flatt, H., Blume, H., and Pirsch, [19] Narasimhan, M. and Bilmes, J. (2005). A supermodular- P. (2010). Real-time stereo vision system using semi-global submodular procedure with applications to discriminative matching disparity estimation: Architecture and FPGA- structure learning. In Uncertainty in Artificial Intelligence. implementation. In ICSAMOS. [20] Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., [3] Bleyer, M. and Gelautz, M. (2008). Simple but effective tree Kim, D., Davison, A. J., Kohli, P., Shotton, J., Hodges, S., structures for dynamic programming-based stereo matching. and Fitzgibbon, A. (2011). Kinectfusion: Real-time dense In VISAPP. surface mapping and tracking. In ISMAR. [4] Brox, T., Bruhn, A., Papenberg, N., and Weickert, J. (2004). [21] Ochs, P., Chen, Y., Brox, T., and Pock, T. (2014). ip- High accuracy optical flow estimation based on a theory for iano: Inertial proximal algorithm for non-convex optimiza- warping. In ECCV. tion. SIAM JIS, 7(2). [5] Brox, T., Bruhn, A., and Weickert, J. (2006). Variational [22] Prusa, D. and Werner, T. (2015). Universality of the local motion segmentation with level sets. In ECCV, volume 3951. marginal polytope. PAMI, 37(4). [6] Chambolle, A. and Pock, T. (2011). A first-order primal- [23] Ranftl, R., Bredies, K., and Pock, T. (2014). Non-local dual algorithm for convex problems with applications to total generalized variation for optical flow estimation. In imaging. Journal of Mathematical Imaging and Vision, 40(1). ECCV. [7] Choi, J. and Rutenbar, R. A. (2012). Hardware implementa- [24] Ranftl, R., Gehrig, S., Pock, T., and Bischof, H. (2012). tion of MRF MAP inference on an FPGA platform. In Field Pushing the limits of stereo using variational stereo estima- Programmable Logic. tion. In Intelligent Vehicles Symposium. [8] Combettes, P. L. and Pesquet, J.-C. (2011). Proximal split- [25] Roth, S., Lempitsky, V., and Rother, C. (2009). Discrete- ting methods in signal processing. In Fixed-Point Algorithms continuous optimization for optical flow estimation. In Statis- for Inverse Problems in Science and Engineering. tical and Geometrical Approaches to Visual Motion Analysis, volume 5604. [9] Drory, A., Haubold, C., Avidan, S., and Hamprecht, F. (2014). Semi-global matching: A principled derivation in [26] Shekhovtsov, A., Kovtun, I., and Hlaváč, V. (2008). Effi- terms of message passing. In Pattern Recognition, volume cient MRF deformation model for non-rigid image matching. 8753. CVIU, 112. [10] Facciolo, G., de Franchis, C., and Meinhardt, E. (2015). [27] Shekhovtsov, A., Swoboda, P., and Savchynskyy, B. MGM: A significantly more global matching for stereovision. (2015). Maximum persistency via iterative relaxed inference In BMVC. with graphical models. In CVPR. [11] Hirschmuller, H. (2011). Semi-global matching- [28] Sinha, S. N., Scharstein, D., and Szeliski, R. (2014). motivation, developments and applications. Efficient high-resolution stereo matching using local plane sweeps. In CVPR. [12] Hosni, A., Rhemann, C., Bleyer, M., Rother, C., and Gelautz, M. (2013). Fast cost-volume filtering for visual cor- [29] Sontag, D. and Jaakkola, T. S. (2009). Tree block coordi- respondence and beyond. PAMI, 35(2). nate descent for MAP in graphical models. In AISTATS. [13] Hurkat, S., Choi, J., Nurvitadhi, E., Martınez, J. F., and [30] Taniai, T., Matsushita, Y., and Naemura, T. (2014). Graph Rutenbar, R. A. (2012). Fast hierarchical implementation of cut based continuous stereo matching using locally shared la- sequential tree-reweighted belief propagation for probabilis- bels. In CVPR. tic inference. In Field Programmable Logic. [31] Valkonen, T. (2014). A primal-dual hybrid gradient method [14] Kappes, J. H., Andres, B., Hamprecht, F. A., Schnörr, C., for nonlinear operators with applications to MRI. Inverse Nowozin, S., Batra, D., Kim, S., Kausler, B. X., Lellmann, J., Problems, 30(5). Komodakis, N., and Rother, C. (2013). A comparative study of modern inference techniques for discrete energy minimiza- [32] Wainwright, M., Jaakkola, T., and Willsky, A. (2005). tion problem. In CVPR. MAP estimation via agreement on (hyper)trees: Message- passing and linear-programming approaches. IEEE Trans- [15] Kolmogorov, V. (2006). Convergent tree-reweighted mes- actions on Information Theory, 51(11). sage passing for energy minimization. PAMI, 28(10). [33] Werlberger, M. (2012). Convex Approaches for High Per- [16] Kolmogorov, V., Pock, T., and Rolinek, M. (2015). Total formance Video Processing. PhD thesis, Institute for Com- variation on a tree. CoRR, abs/1502.07770. puter Graphics and Vision, Graz University of Technology, Graz, Austria. [34] Werlberger, M., Pock, T., and Bischof, H. (2010). Motion If i and j are two nodes in a chain f +λ then performing estimation with non-local total variation regularization. In the update of λi changes the min-marginal at j and vice- CVPR. versa. The updates must be implemented sequentially or otherwise one gets a non-monotonous behavior and the [35] Werner, T. (2007). A linear programming approach to max- sum problem: A review. PAMI, 29(7). method may fail to converge (see [15]). TRW-S gains its efficiency in that after the update (31), [36] Woodford, O., Torr, P., Reid, I., and Fitzgibbon, A. (2009). the min-marginal at a neighboring node can be recom- Global stereo reconstruction under second-order smoothness puted by a single step of dynamic programming. Let the priors. PAMI, 31(12). neighboring node be j = i + 1. The expression for the [37] Zabih, R. and Woodfill, J. (1994). Non-parametric local right min-marginal at j remains correct and the expres- transforms for computing visual correspondence. In ECCV, sion for left min-marginal is updated using its recurrent volume 801. expression ϕij(xj) := min ϕ Appendix A. Details of Dual MM i−1,i(xi) + fi(xi) + fij (xi, xj ), (32) xi In this section we specify details regarding computa- also known as message passing. Then min-marginal at j tion of minorants in Dual MM. The minorants are com- becomes available through (29). puted using message passing and we’ll also need the no- It is possible to perform update (31) in parallel by scal-tion of min-marginals. ing down the step size by the number of variables (or the length of the chain). This is equivalent to decom- A.1. Min-Marginals and Message Passing posing a chain f into n copies with costs f /n so that Definition A.1. For cost vector f its min-marginal at they contribute one for each node i with a min-marginal node i is the function mf : L → mf (x R given by i)/n. Confer to the parallel tree block update algo- rithm of Sontag and Jaakkola [29, Fig. 1]). However, the mf (xi) = min f (x). (26) gain from the palatalization does not pay off the decrease xV\i in the step size. Function mf (xi) is a min projection of f (x) onto xi A.2. Slacks only. Given the choice of xi, it returns the cost of the best labeling in f that passes through x In the following we will also use the term slack. i. For a chain problem it can be computed using dynamic programming. Let us Shortly, it is explained as follows. The dual problem (8) assume that the nodes V are enumerated in the order of can be written as a linear program, see e.g., [35]. Dual in-the chain and E = {(i, i + 1) | i = 1 . . . |V| − 1}. We then equality constraints in that program can satisfied as equal- need to compute: left min-marginals: ϕ ities, in which case they are tight, or they can be satisfied i−1,i(xi) := as strict inequalities in which case there is a slack. Equiv- X X min f alent reparametrization of the problem (change of the dual i0 (xi0 ) + fi0j0(xi0, xj0); (27) x1,...i−1 i0i i0j0∈E | i0≥i maximum slack that can be concentrated in a label-node equals the corresponding min-marginal. These values for all ij ∈ E, xi, xj ∈ L can be computed dynamically (recursively). After that, the min-marginal A.3. Good Minoratns mf (xi) expresses as Definition A.2. A modular minorant λ of f is maximal mf (x if there is no other modular minorant λ0 ≥ λ such that i) = fi(xi) + ϕi−1,i(xi) + ϕi+1,i(xi). (29) λ0(x) > λ(x) for some x. TRW-S method [15] can be derived as selecting one Lemma A.3. For a maximal minorant λ of f all min- node i at a time and maximizing (8) with respect to λi marginals of f − λ are identically zero. only. For the two slave problems in (8) TRW-S needs to compute min-marginals mf+λ(xi) and mg−λ(xi). A Proof. Since λ is a minorant, min-marginals mi(xi) = (non-unique) optimal choice for λi would be to ensure that minx [f (x) − λ(x)] are non-negative. Assume for con- V\i tradiction that ∃i, ∃xi such that mi(xi) > 0. Clearly, mf+λ(xi) = mg−λ(xi) ∀xi ∈ L (30) λ0(x) := λ(x) + mi(xi) is also a minorant and λ0 > λ. by setting Even using maximal minorants, the Algorithm 2 can λi := λi + (mg−λ(xi) − mf+λ(xi))/2. (31) get stuck in fixed points which do not satisfy weak tree agreement [15], e.g. suboptimal even in the class of mes-Algorithm 3: Maximal Uniform Minorant sage passing algorithms. Consider the following example Input: Chain subproblem f ; of a minorant leading to a poor fixed point. Output: Minorant λ; Example A.4. Consider a model in Figure 8 with two la- 1 λ := 0; bels and strong Ising interactions ensuring that the optimal 2 while true labeling is uniform. If we select minorants that just takes 3 Compute min-marginals m of f − λ; the unary term, without redistributing it along horizontal 4 if m = 0 then return λ; or vertical chains, the lower bound will not increase. For 5 Let O := [ m = 0]], the support set of optimal example, for the horizontal chain (v solutions of m − λ; 1, v2), the minorant (1, 0) (displayed values correspond to λ 6 Find max{ε | (∀x) εh1 − O, xi ≤ (f − λ)(x)}; v (1) − λv (2)). This minorant is maximal, but it does not propagate the 7 Let λ := λ + ε(1 − O); information available in v1 to v2 for the exchange with the vertical chain (v2, v4). The optimization problem in Line 6 can be solved us- ing the minimum ratio cycle algorithm of Lawler [17]. We +1 search for a path with a minimum ratio of the cost given by (f − λ)(x) to the number of selected labels with non- zero min-marginals given by h1 − O, xi. This algorithm is rather efficient, however Algorithm 3 it is still too costly and not well-suited for a parallel implementation. We will +0.5 not use this method in practice directly, rather it estab- lishes a sound baseline that can be compared to. The resulting minorant λ is maximal and uniform in Figure 8: Example minorize-minimize stucks with a minorant the following sense. that does not redistribute slack. Lemma A.6. Let m be the vector of min-marginals of f . The uniform minorant λ found by Algorithm 3 satisfies A.3.1 Uniform Minorants λ ≥ m/n, (34) Dual algorithms, by dividing the slacks between subprob- where n is the length of the longest chain in f . lems ensure that there is always a non-zero fraction of it (depending on the choice of weights in the scheme) prop- Proof. This is ensured by Algorithm 3 as in each step the agated along each chain. We need a minorant, which will increment ε results from dividing the min-marginal by expose in every variable what is the preferable solution h1 − O, xi which is at most the length of the chain. for the subproblem. We can even try to treat all variables In fact, when the chain is strongly correlated, the mi- uniformly. The practical strategy proposed below is moti- norant will approach m/n and we cannot do better than vated by the following. that. However, if the correlation is not as strong the mi- Proposition A.5. Let f ∗ = min norant becomes tighter, and in the limit of zero pairwise x f (x) and let Ou be the support set of all optimal solutions x∗ interactions there holds λ = m. In a sense the minorant u in u ∈ V . Con- sider the minorant λ given by λ computes “decorrellated” min-marginals. u(xu) = ε(1 − Ou) and maximizing ε: The next example illustrates uniform minorants and steps of the algorithm. max{ε | (∀x) εh1 − O, xi ≤ f(x)}. (33) Example A.7. Consider a chain model with the following The above minorant assigns cost ε to all labels but data unary cost entries (3 labels, 6 nodes): those in the set of optimal solutions. If the optimal so- 0 0 1 0 0 8 lution x∗ is unique, it takes the form λ = ε(1 − x∗). 9 7 0 3 2 8 This minorant corresponds to the direction of the subgra- 7 3 6 9 1 0 dient method and ε determines the step size which ensures The regularization is a Potts model with cost f monotonicity. However it is not maximal. In f − λ there uv (xu, xv ) = 1[[xu 6= xv] . Min-marginals of the still remains a lot of slack that can be useful when ex- problem and iteration of Algorithm 3 are ilustrated in changing to the other problem. It is possible to consider Figure 9. At the first iteration the constructed minorant is f − λ 0 0 0 0 0 1 again. If we have solved (33), it will necessarily 1 1 1 1 1 1 have a larger set of optimal solutions. We can search for 1 1 1 1 1 0 a maximal ε1 that can be subtracted from all non-optimal And the final minorant is: label-nodes in f − λ and so on. The algorithm is specified as Algorithm 3. 0 0 0 0 0 7 Algorithm 4: Iterative Minorant 8 7 1 2 2 7 Input: Chain subproblem f ; 6 4 6 7 1 0 Output: Minorant λ; The minorant follows min-marginals (first plot in Fig- 1 λ := 0; ure 9), because the interaction strength is relatively weak 2 for s = 1 . . . max pass do and min-marginals are nearly independent. If we in- 3 for i = 1 . . . |V | do crease interaction strength to 5, we find the following min- 4 Compute min-marginal mi of f − λ at i marginals and minorant, respectively: dynamically, equations (32) and (29); 0 0 0 0 0 3 5 λi += γsmi; 14 15 8 8 7 8 6 Reverse the chain; 12 13 15 10 1 0 0 0 0 0 0 3 5.5 5.5 3 3 3 3 efficiently alternates between the forward and the back- ward passes. For the last pass coefficient γ 4.75 4.75 4.75 4.75 1 0 s is set to 1 to ensure that the output minorant is maximal. Figure 10 It is seen that in this case min-marginals are correlated and illustrates that this idea can perform well in practice. only a fraction can be drained in parallel. The uniform approach automatically divides the cost equally between strongly correlated labels. TRW−S 8200 TRW−S primal DMM−uniform Primal DMM−Iterative-s−3−frac−0.25 Primal 8000 0 0 0 0 0 7 DMM−Batch-Iter−3−frac−0.25 Primal (a) 10 8 1 4 3 8 7800 8 5 7 10 1 0 7600 0 0 0 0 0 6 7400 (b) 9 6 0 1 2 7 7200 2 3 4 5 6 7 8 9 10 7 3 6 7 0 0 Figure 10: 0 0 0 0 0 5 Same setting as in Figure 3. The new plots show that Iterative minorants are not as good as uniform but still perform (c) 8 5 0 0 0 5 very well. Parameter max pass = 3 and γs = 0.25 were used. 6 2 5 6 0 0 The Batch Iterative method (Batch-Iter) runs forward-backward iterations in a smaller range, which is more cache-efficient and Figure 9: (a) Min-marginals (normalized by subtracting the is also performing relatively well in this example. value of the minimum) at vertices and arrows allowing to back- track the optimal solution passing through a given vertex. (b), (c) min-marginals of f − λ after one (resp. two) iterations of Al- A.3.3 Hierarchical Minorants gorithm 3 (ε1 = 1 and ε2 = 1). With each iteration the number of vertices having zero min-marginal strictly increases. The idea of hierarchical minorants is as follows. Let f be a one horizontal chain. We can break it into two sub- chains of approximately the same size, sharing a variable A basic performance test of Dual MM with uniform xi in the middle. By introducing a Lagrange multiplier minorants versus TRW-S is shown in Figure 3. It demon- over this variable, we can decouple the two chains. The strates that the Dual MM can be faster, when provided value of the Lagrange multiplier can be chosen such that good minorants. The only problem is that determining the both subchains have exactly the same min-marginals in uniform minorant involves repeatedly solving minimum xi. This makes the split uniform in a certain sense. Pro- ratio path problems, plus there is a numerical instability ceeding so we increase the amount of parallelism and hi- in determining the support set of optimal solutions O. erarchically break the chain down to two-variable pieces, for which the minorant is computed more or less straight- A.3.2 Iterative Minorants forwardly. This is the method used to obtain all visual experiments in the paper. Its more detailed benchmarking A simpler way to construct a maximal minorant would is left for future work. We detail now the simplest case be to iteratively subtract from f a portion of its min- when the chain has length two, i.e., the energy is given by marginals and accumulate them in the minorant, until all f min-marginals of the reminder become zero. Algorithm 4 1(x1) + f12(x1, x2) + f2(x2). The procedure to compute the minorant is as follows: implements this idea. The portion of min-marginals • Compute mf drained from the reminder f − λ to the minorant λ in each 1 (x1) and let λ1 := mf 1 (x1)/2. I.e., we subtract a half of the min-marginal in the first node. iteration is controlled by γs ∈ (0, 1]. Reversing the chain • Recompute the new min-marginal at node 2: update Algorithm 5: Handshake [>>>>>>>>>>>>>>><<<<<<<<<<<<<<<] [.......<<<<<<<][>>>>>>>.......] Input: Energy terms fi, fj, fij, messages ϕi−1,i(xi) [...<<<][>>>...][...<<<][>>>...] and ϕj,j+1(xj) ; [.<][>.][.<][>.][.<][>.][.<][>.] Output: Messages for decorrellated chains: ϕji(xi) [][][][][][][][][][][][][][][][] and ϕij(xj) ; /* Message from j to i */ Figure 11: Messages passed in the construction of the hier- 1 ϕji(xi) := Msgji(fj + ϕj,j+1); archical minorant for a chain of length 32. From top to bot- /* Total min-marginal at i */ tom: level of hierarchical processing. Symbols > and < denote 2 m message passing in the respective direction. Brackets [] mark i(xi) := ϕi−1,i(xi) + fi(xi) + ϕji(xi); / the limits of the decorrellated sub-chains at the current level. * Share a half to the right */ Dots denote places where the previously computed messages in 3 ϕij (xj ) := Msgij (mi/2 − ϕji); the needed direction remain valid and need not be recomputed. /* Bounce back what cannot be shared */ Places where the two opposite messages meet correspond to the 4 ϕji(xi) := Msgji(−ϕij ); execution of the Handshake procedure. The lowest level con-5 Procedure Msgij (a) sists of 16 decorrellated chains of length 2 each. Input: Unary cost a ∈ K R ; Output: Message from i to j; A.4. Iteration Complexity 6 return ϕ(xj) := minx a(x i ∈L i) + fij (xi, xj ); The bottleneck in a fast implementation of dual algo- rithms are the memory access operations. This is simply the message ϕ12(x2) := Msg12(f1 − λ1); Reassem- because there is a big cost data volume that needs to be ble mf−λ 2 (x2) = ϕ12(x2) + f2(x2). scanned in each iteration plus messages have to be red and • Take this whole remaining min-marginal to the mi- written in TRW-S as well as in out Algorithm 2 (dual vari-norant: let λ2 := mf−λ 2 (x2). ables λ). We therefore will assess complexity in terms of • Recompute the new min-marginal at node 1: update memory access operations and ignore the slightly higher the message ϕ21(x1) := Msg21(f2 −λ2); It still may arithmetic complexity of our minorants. be non-zero. For example, if the pairwise term of f For TRW-S the accesses per pixel are: is zero we recover the remaining half of the initial • read all incoming messages (4 access); min-marginal at node 1. Let λ1 += mf−λ 1 (x1). • read data term (1 access); Importantly, the computation has been expressed in terms • write out messages in the pass direction (2 accesses). of message passing, and therefore can be implemented The cache can potentially amortize writing messages and as efficiently. The procedure fro the two-node case is reading them back in the next scan line, in which case straightforwardly generalized to longer chains. Let ij be the complexity could be counted as 5 accesses per pixel. an edge in the middle of the chain. We compute left min- However, currently only CPU cache is big enough for this, marginal at i, right min-marginal at j and then apply the while multiprocessors in GPU have relatively small cache Handshake procedure over the edge ij, defined in Algo- divided between many parallel threads. rithm 5. The procedure divides the slack between nodes i For the iterative minorant we have 3 forward-backward and j similarly to how it is described above for the pair. passes reading the data cost, the reverse message and writ- The result of this redistribution is encoded directly in the ing the forward message (3*2*3 accesses), the last itera- messages. The two subchains 1, . . . i and j, . . . |V| are tion writes λ and not the message. Some saving is pos- “decorrellated” by the Handshake and will not talk to sible with a small cache set at a cost of more computa- each other further during the construction of the minorant. tions. Computing the hierarchical minorant as described The left min-marginal for subchain j, . . . |V| at node j + 1 in Figure 11 for a chain of length 2048, assuming that is computed using update (32) and so on until the mid- chunks of size 8 already fit in the fast memory (registers dle of the subchain where a new Handshake is invoked. + shared memory) has the following complexity. Read- The minorant is computed at the lowest level of hierarchy ing data costs and writing messages until length 8 totals when the length of the subchain becomes two. The struc- to 2 + log2(2048/8)/2 = 6 accesses. Reading messages ture of the processing is illustrated in Figure 11. It is seen is only required at Handshake points and needs to be that each level after the top one requires to send messages counted only until reaching length 8. Writing λ adds one only for a half of nodes in total. Moreover, there is only more access. These estimates are summarized in Table 2. a logarithmic number of level. It turns out that this pro- cedure is not much more computationally costly than just TRW-S Iterative Naive BCD Hierarchical computing min-marginals. For example, to restore left 7(5) 18(8) 5(4) 7 min-marginal for the subchain j, . . . |V |, in node i + 1 we We conjecture that while iterative minorants may trans- Table 2: Memory accesses per pixel in TRW-S and Dual MM fer only a geometric fraction of min-marginals in some with variants of minorants. Naive BCD here means just comput- ing min-marginals. cases, the hierarchical minorant is only by a constant fac- tor inferior to the uniform one. 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 Touching without vision: terrain perception in sensory deprived environments Vojtěch Šalansk´y∗, Vladim´ır Kubelka∗†, Karel Zimmermann∗, Michal Reinstein∗, Tomáš Svoboda∗† Abstract. In this paper we demonstrate a combined hardware and software solution that enhances sensor suite and perception capabilities of a mobile robot intended for real Urban Search & Rescue missions. A common fail-case, when exploring unknown envi- ronment of a disaster site, is the outage or deteriora- tion of exteroceptive sensory measurements that the Figure 1. From left: UGV robot approaches smoke area; robot heavily relies on—especially for localization Example of visual information that the operator sees and navigation purposes. Deprivation of visual and inside a cloud of smoke: a crop out from the omni- laser modalities caused by dense smoke motivated directional camera (middle) and output of the laser range- us to develop a novel solution comprised of force finder (rainbow-colored point cloud in the right half of the sensor arrays embedded into tracks of our platform. image). Laser beams are randomly reflected by smoke particles. The resulting 3D point cloud is just noise close Furthermore, we also exploit a robotic arm for ac- to the robot. tive perception in cases when the prediction based on force sensors is too uncertain. Beside the integration of hardware, we also propose a framework exploiting project1, which develops novel software and tech- Gaussian processes followed by Gibb’s sampling to nology for human-robot teams in disaster response process raw sensor measurements and provide prob- efforts [1], we have to deal with such problems. abilistic interpretation of the underlying terrain pro- One of the crucial fail-cases is the presence of file. In the final, the profile is perceived by propri- dense smoke that blocks camera view and spoils laser oceptive means only and successfully substitutes for measurements, creating false obstacles in front of the the lack of exteroceptive measurements in the close robot (Fig. 1). Without exteroceptive measurements, vicinity of the robot, when traversing unknown and classical approaches to robot SLAM cannot be used. unseen obstacles. We evaluated our solution on real Localization can only be in the dead-reckoning sense world terrains. and the operator of the robot has to rely solely on the maps created up to the point of the sensor out- age. In an industrial environment consisting of many 1. Introduction hazardous areas, driving blind can lead to damage or loss of the robot. Advances in robotic technology allow mobile Therefore, we propose a combined hardware and robots to be deployed in gradually more and more software solution to predict the profile of terrain un- challenging environments. However, real-world con- derneath and in front of the tracked robot. The al- ditions often complicate or even prohibit adoption of gorithm exploits a prototype of a force sensor array classical approaches to localization, mapping, nav- installed inside a track of the robot, a robotic arm igation, or teleoperation. When rescuers operate attached to the robot, proprioceptive measurements a UGV during joint experiments in the TRADR from joints and an inertial measurement unit (IMU), and information learned from a dataset of traversed ∗Authors are with the Faculty of Electrical Engineer- terrains. The prototype of the force sensor (Fig. 2, 3) ing, Czech Technical University in Prague, {salanvoj, kubelvla, reinstein.michal, zimmerk, is suitable for tracked robots and is installed between svobodat}@fel.cvut.cz rubber track and its support, allowing it to serve as †Authors are with the Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague 1http://www.tradr-project.eu a tactile sensor. The arm is used to measure height of tion; they mimic facial whiskers of animals and us- terrain outside the reach of the force sensor as contact ing them as a tactile sensor is a promising way to between the arm end-effector and the terrain. The explore areas, which are prohibitive to standard ex- height of terrain that cannot be measured directly is teroceptive sensors. Work of [5] presents a way to estimated by sampling from a joint probability dis- use array of actively actuated whiskers to discrimi- tribution of terrain heights, conditioned by propri- nate various surface textures. In [6], similar sensor oceptive measurements (geometric configuration of is used for a SLAM task. Two sensing modalities— the robot, torques in joints and attitude of the robot) the whisker sensor array and the wheel odometry are and learned from a training dataset consisting of real- used to build a 2D occupancy map. Robot localiza- world examples of traversed terrains. tion is then performed using particle filter with par- The estimates of terrain profile are used as a par- ticles representing one second long ”whisk periods”. tial substitute for missing laser range-finder data that During these periods, the sensor actively builds lo- would reveal obstacles or serve as an input for adap- cal model of the obstacle it touches. Unfortunately, tive traversability algorithm. design of our platform does not allow using such Our contribution is twofold: we designed a new whiskers due to rotating laser range-finder. force sensor suitable for tracked robots as well as an Relation between shape of terrain that we are in- algorithm that uses proprioceptive and tactile mea- terested in and configuration of the flippers is investi- surements to estimate terrain shape in conditions that gated in [7]. The authors exploit the knowledge about prohibit usage of cameras and laser range-finders. robot configuration and torques in joints to define We extended this solution with robotic arm to deal a set of rules for climbing and descending obstacles with special cases when the predictions have too high not observed by exteroceptive sensors. We investi- uncertainty. gated this problem in [8] by introducing the adaptive The rest of the paper is structured as follows: Sec- traversability algorithm based on machine learning. tion II concludes briefly the related work, Section III We collected features from both proprioceptive and describes the hardware solution and Section IV the exteroceptive sensors to learn a policy that ensures actual software. In Section V we present both qual- safe traversal over obstacles by adjusting robot mor- itative and quantitative experimental evaluation and phology. An idea of adding pressure sensors mimick- we conclude our achievements in Section VI. ing properties of human skin to feet of bipedal robots is presented in [9, 10]. These sensors can be used 2. Related work for measuring force distribution between the robotic The problem of terrain characterization primarily foot and ground, or for terrain type classification. In using proprioceptive sensors, but also by sonar/infra- tracked robots, caterpillar tracks can be further used red range-finders and by a microphone is discussed in to explore terrain, authors of [11] propose a novel [2]. The authors exploit neural networks trained for distributed sensor that detects deflection of the track each sensor and demonstrate that they are able to rec- in contact points with terrain. Their sensor is espe- ognize different categories: gravel, grass, sand, pave- cially suitable for chained tracks with rubber shoes. ment and dirt surface. More recent results come from The prototype we present is more suitable for thin legged robotics, in [3], Pitman-Yor process mixture rubber tracks. of Gaussians is used to learn terrain types both in On contrary to the approaches exploiting only supervised and unsupervised manner based on force simple contact sensors, we extend our sensory suite and torque features sensed in legs. In our work, we with a robotic arm for further active perception for focus more on the actual terrain profile prediction, cases if necessary. Related to the active perception, necessary for successful traversal. relevant ideas and techniques come from the field of Lack of sufficient visual information related to haptics. The work of [12] proposes to create mod- danger of collision with obstacles is addressed in els of objects in order to be able to grasp them. The [4]: decision whether it is safe to navigate through idea is to complement visual measurements by tac- vegetation is based on wide-band radar measure- tile ones by strategically touching the object in ar- ments since it is impossible to detect solid obstacle eas with high shape uncertainty. For this purpose behind vegetation from laser range-finder or visual they use Gaussian processes (GP, [13]) to express the data. Artificial whiskers offer an alternative solu- shape of the object. We take a similar approach: we choose parts of terrain to be explored by the robotic arm based on uncertainty of the estimate resulting from the sampling process (Sec. 4.3). Probabilistic approach to express uncertainty in touched points is also described in [14], where only tactile sensors of a robotic hand are used to reconstruct the shape of an unknown object. Active tactile terrain exploration Figure 2. Prototype of the flipper force sensor: array of six can also lead to terrain type classification, as works sensing elements (FSR 402) is covered by a stripe of steel, of [15, 16] demonstrate. forming a thin sensor that fits between the rubber track and the plastic track support. The stripe of steel protects 3. Sensors the sensors from the moving rubber track and distributes measured force amongst them. 3.1. Sensors of the TRADR UGV The TRADR UGV platform is equipped with both proprioceptive and exteroceptive sensors. Inertial measurement unit Xsens MTi-G (IMU) provides ba- sic attitude measurements; all joints have angle en- coders installed to reveal current configuration of the robot like flipper angles, and velocity of the caterpil- lar tracks. Electric currents to all motors are mea- sured and translated into torque values. Visual in- Force sensing elements Analog-to-digital converter formation about the environment is acquired by an +5V omni-directional Point Grey Ladybug 3 camera ac- FSR 402 companied by a rotating SICK LMS-151 laser range I2C ADC Pi Plus Raspberry Pi 2B finder that provides depth information. The laser R1 10k range-finder is used to collect data that are processed to serve as ground truth for the terrain reconstruction purposes. ... 3.2. Prototype of force sensor Figure 3. The sensor mounted to the plastic track support To obtain well-defined contact points with the (top). The sensing elements are passive sensors that ex- ground, we decided to take advantage of the flippers hibit decrease in resistance with applied force. For each sensing element, we use a reference resistor to form a volt- that can reach in front of the robot and are designed to age divider; we obtain voltage inversely proportional to operate on dirty surfaces or sharp edges. The original the resistance of the FSR 402 elements (bottom). mechatronics of the robot allows to measure torque in flipper servos and thus detect physical contact be- tween flippers and the environment. To be able to ing force; the force sensitivity range is 0.1 − 10 N. locate the contact point on the flipper exactly, we de- To measure the resistance, we connect them in series signed a thin force sensor between the rubber track with a fixed reference resistor forming a voltage di- and its plastic support (see Fig. 2, 3). Since it is a first vider. We apply 5 V to this divider and measure volt-prototype, we use it only in one flipper and consider age on the reference resistor. We use an analog-to- only symmetrical obstacles or steps. The sensor con- digital converter expansion board for the Raspberry struction is a sandwich of two thin stripes of steel Pi computer to read the six voltages. We calibrate with FSR 402 sensing elements between them which the voltage values for initial bias caused by the sand- allows the rubber track to slide over it while mea- wich construction. suring forces applied onto the track. There are six Figure 4 shows three examples of the sensor read- force sensing elements; the protecting sheet of steel ings. The first case consists of a flipper touching flat distributes the force among them, the sensor is thus floor. Although one would expect to see more or less sensitive along its whole length. equal distribution of the contact force along the flip- The FSR 402 sensing elements are passive sen- per track, the torque generated by the flipper actually sors that exhibit decrease in resistance with increas- lifts the robot slightly and thus, most of the force con- 5 allows the robot to measure the height of terrain in 10N) a chosen point by gradually lowering the arm until ≈ 4 upsurge of actuator currents indicates contact with 3 ground (there are currently no touch sensors) [17]. Accuracy of the measurement is 3 cm (standard de- 2 viation). However, the process of unfolding the arm, 1 planning and execution of the desired motion and fi- nally folding back to home position can easily take Sensor element output (5 units 0 1 2 3 4 5 6 45 s. Therefore, it is practical to use the arm for this Sensor element number purpose only in situations when the gain from the 5 10N) additional information overweights the cost of time ≈ 4 spent to get it. In Section 4.4, we describe criterion for decision to use the arm. 3 2 4. Terrain shape reconstruction 1 Sensor element output (5 units 0 When robot is teleoperated operator’s awareness is 1 2 3 4 5 6 Sensor element number based on camera images and the 3D laser map. In the 5 presence of smoke, both of these modalities are use- 10N) less, see output of the operator console in the pres- ≈ 4 ence of smoke shown in Figure 1. We propose active 3 tactile exploration mode (ATEM), in which flippers and robotic arm autonomously explores the terrain 2 shape in close vicinity of the robot. Estimated ter- 1 rain shape and expected reconstruction accuracy are Sensor element output (5 units eventually displayed to the operator. 0 1 2 3 4 5 6 Sensor element number If ATEM is requested by the operator, robot first Figure 4. Examples of the force sensor readings. The plots adjusts flippers to press against the terrain and cap- on the left side show raw readings of each sensing ele- ture proprioceptive measurements. Then the initial ment, only corrected for bias. The photos on the right side probabilistic reconstruction of the underlying terrain document the moments of the readings acquisition. See shape is estimated from the captured data. If the re- section 3 for discussion over the three example cases. construction is ambiguous, the robotic arm explores the terrain height in the most inaccurate place. Even- centrates at its tip (element n. 6). Compare this case tually, the probabilistic reconstruction is repeated. with the third one (bottom), where the pose of the As a result, reconstructed terrain shape with esti- robot prohibits the lifting effect, and we therefore see mated variances is provided. The ATEM procedure the expected result. The second case (middle) shows is summarized in Algorithm 1. The rest of this sec- an example of a touch in one isolated point. tion provides detailed description of particular steps. 3.3. Robotic arm 4.1. Flipper exploration mode The UGV is equipped with a Kinova Jaco robotic arm1, see Fig. 1 left. It is a 6-DOF manipulator (with As soon as the ATEM is requested, the robot halts one extra DOF in each finger) capable of lifting 1.5 driving and adjusts angles of front flippers towards kg. For our approach, it is used for tactile exploration ground until they reach an obstacle or the ground. of surroundings up to cca. 50 cm around the robot. They keep pressing against it by defined torque while For the terrain sensing, robotic arm holds a tool with vector of proprioceptive measurements s is captured. a wooden stick—this setup protects its fingers from We measure: i) pitch of the robot (estimated from being broken when pushing against ground. It also IMU sensor), ii) angles of flippers, iii) currents in 1http://www.kinovarobotics.com/service- flipper engines, and iv) 6-dimensional output of the robotics/products/robot-arms force sensor. Variables: h - vector of terrain bin heights, distributions p(hI |hJ\I , s) of all missing heights hI . v - vector of height variances, Missing heights hI are reconstructed as the mean of s - vector of proprioceptive measurements. generated samples, variances v while ATEM is requested do I are estimated as the stop robot; variance of samples. // Invoke flipper exploration mode In the beginning, the missing heights hI are ran- // Section 4.1 domly initialized. The k-th sample hk is obtained while torque in front flippers < threshold do I by iterating over all unknown bins i ∈ I and gener- push flippers down; end ating their heights hk from conditional probabilities i s = capture proprioceptive measurements(); p(hi|hJ\i, s). The conditional probability is mod- // Perform kinematic reconstruction eled by Gaussian process [19, 13, 20] with a squared // Section 4.2 exponential kernel. [h, v] = kinematic reconstruction(s); To train the conditional probabilities, we collected // Perform probabilistic reconstr. real-world trajectories with i) sensor measurements // Section 4.3 [h, v] = probabilistic reconstruction(h, v, s); su and ii) corresponding terrain shapes hu estimated // Invoke arm exploration from the 3D laser map for u = 1 . . . U . The i-th // Section 4.4 conditional probability p(hi|hJ\i, s) is modeled by if any(v > threshold) then one Gaussian process learned from the training set [h, v] = arm exploration(h, v); {[(h1 , s1)>, h1], . . . , [(hU , sU )>, hU ]}. [h, v] = probabilistic reconstruction(h, v, s); J \i i J \i i end Modeling the bin height probabilities as normal move forward; distributions is a requirement laid by the Gaussian end process. However, it allows samples of the bin height that collide with the body of robot, which is of course Algorithm 1: Active tactile exploration mode for physically impossible. We propose to use Gaus- terrain shape reconstruction. sian distribution truncated by known kinematic con- straints, in which are samples constrained by the 4.2. Kinematic reconstruction maximal height that does not collide with the body of the robot. We discuss impact of this modification The terrain shape is modeled by Digital Eleva- in the Section 5. tion Map (DEM), which consists of eleven 0.1 m- wide bins. If there is only one isolated contact point 4.4. Active arm exploration sensed by the force sensor and the force surpasses We use the robotic arm to measure the height of experimentally identified threshold (see Fig. 4, sec- the terrain in bins the flippers cannot reach. The ond case), the height hi of the terrain in the corre- measurement taken by the robotic arm is reasonably sponding bin i is estimated by a geometric construc- accurate and precise but in its current state it takes tion from known robot kinematics, using the attitude about 45s to complete [17]. If the probabilistic recon- of the robot, configuration of joints and the position struction contains bins with variance v higher than of the contact point on the flipper. Variance vi for a user-defined threshold, the robotic arm is used to the corresponding force sensor is set to an experi- measure the height in the most uncertain bin, i.e. the mentally estimated value. The remaining hi and vi bin j = arg max values are set to non-numbers. i vi. The height sensed in the given bin is then fixed and the probabilistic reconstruction 4.3. Probabilistic reconstruction process is repeated. In the probabilistic reconstruction procedure, the 5. Experimental evaluation vector of heights h and the vector of variances v are estimated by the Gibbs sampling [18]. Let us de- In qualitative experiments, we focus on typical note the set of all bins J and the set of all bins in cases of terrain profile shapes and discuss perfor- which the reconstruction is needed by I (i.e. those mance of different settings of our algorithm. In quan- which height was not estimated in the kinematic re- titative experiments, we present performance statis- construction procedure or measured by the robotic tics over the whole testing dataset. arm). We use the Gibbs sampling to obtain height The training dataset consists of 28 runs contain- samples hk, k = 1 . . . K from the joint probability ing driving on flat terrain, approaching obstacles of I 5 4 3 2 1 Sensor element output 0 1 2 3 4 5 6 Sensor element number Figure 5. From left: photo of the robot on a concrete ground; measured forces; terrain reconstruction, the gray polygon indicates position of the robot and its flippers, thin red line is the ground truth—flat ground in this instance. two different heights, traversing them and descend- fourth approach adds direct terrain measurement: we ing from them back to flat ground. Shape of obsta- simulate use of the robotic arm for measurements the cles selected for the dataset reflects the industrial ac- terrain height in bins with high uncertainty [17]. The cident scenario of the TRADR project - the environ- simulation means revealing the value of the bin cap- ment mostly consists of right-angle-shaped concrete tured in the ground truth, variance of the bin is then and steel objects. From the recorded runs, we have equal to the variance of the arm measurements. In extracted approximately 1400 individual terrain pro- the experiments shown in this paper we set the stan- file measurements for training. The whole training dard deviation threshold of Gibbs samples that leads dataset was recorded indoors on flat hard surfaces. to arm exploration to 0.06 m. The fourth approach is The testing dataset was recorded outdoors and com- called as PAFAc (pitch + angle of flippers + flipper bines uneven grass, stone and rough concrete sur- force sensor + robotic arm; constrained). faces. It contains more complex obstacles with vari- ous heights (different from those seen in the training 5.1. Qualitative Evaluation dataset). The testing dataset consists of more than In the figures 5, 6 and 7, we present typical ter-300 terrain profiles with the corresponding sensory rain profiles and robot actions: flat ground, two steps data. Ground truth necessary for training and test- with different height, climbing up a step and stepping ing was created manually by sampling scans from the down of a step. We compare performance of two al- laser range-finder recorded during the experiments. gorithms: i) PAc uses the kinematic constraints when We compare four different algorithms for terrain sampling but does not use the force sensors (light profile prediction. The baseline approach [8] uses blue line in the plots) ii) PAFc algorithm which uses only the IMU sensor and angles of flippers, we call the force sensors (green line and bars). The last two it PA (pitch + angle of flippers) for short. The sec- bars marked yellow in order to emphasize the predic- ond setup uses the same data and adds the probability tions are learnt from training dataset and we do not of terrain height being adapted in the way described have enough information to correct the predictions in Section 4.3. If the sampled height collides with from the sensing by flippers. the robot, the sample is set to the maximal possible We use mean of the (Gibbs) samples as the pre- height that is not in collision. The approach is called dicted value (connected by lines) and 0.1 and 0.9 PAc (pitch + angle of flippers; constrained). The quantiles for displaying dispersion of samples (error- third approach adds the flipper force sensor; mea- bars). The point (0, 0) coincides with the location of sured data are used in two ways. If the force mea- the IMU sensor inside the robot body. The depicted sured by a sensor element exceeds a threshold (ex- sketch of the robot: the pitch is estimated by IMU, perimentally set on 2 units), then the height of the flipper angle is directly measured. When the robot bin is computed from kinematics of the robot (pitch lies on a flat ground, Fig. 5, contact point is sensed and flipper angles and position of the sensor element) by the sixth element. The force measurement reduces and the bin is fixed and excluded from the Gibbs sam- uncertainty mainly in positions 0.3 − 0.7 m. pling step. It should be noted however, that the mea- Climbing up a step cases are depicted in Fig. 6. sured forces are used even if they are not bigger than The higher 0.28 m step obstacle is on top. The fifth the threshold – they are part of the proprioceptive sensor element measures the force that is bigger than data s. The approach is called as PAFc (pitch + angle threshold and the height in the bin 0.4 is fixed and of flippers + flipper force sensor; constrained). The not sampled. Note that algorithm PAc which does 5 4 3 2 1 Sensor element output 0 1 2 3 4 5 6 Sensor element number 5 4 3 2 1 Sensor element output 0 1 2 3 4 5 6 Sensor element number Figure 6. Top: 0.28 m step, bottom: 0.2 cm step. Note the reduced uncertainty for the PAFc – green line and errorbars. The top photo of the robot is flipped in order to preserve left-to-right orientation which should ease the visual comparison. 5 4 3 2 1 Sensor element output 0 1 2 3 4 5 6 Sensor element number 5 4 3 2 1 Sensor element output 0 1 2 3 4 5 6 Sensor element number Figure 7. Top: climbing up a step; Bottom: stepping down of a step. When stepping down, the robot “hangs” on the rear flippers, not the main flippers. 600 0.15 PAFc 0.14 PA PAc 0.13 500 PAc 0.12 PAFc 0.11 PAFAc 400 0.1 0.09 0.08 300 0.07 Frequency 0.06 200 0.05 0.04 Reconstruction error [m] 0.03 100 0.02 0.01 0 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 −0.3−0.2−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 error [m] x [m] Figure 8. Quantitative evaluation of reconstruction quality Figure 9. Quantitative evaluation of terrain profile recon- in the places/bins that are under the flipper. struction – for all the DEM bins. Median, 1st quartile and 3rd quartile of errors are shown not use force sensor cannot predict the exact edge To evaluate our proposed solution experimentally, location. This fact is indicated by big dispersion of we designed and compared four algorithms—four samples in bins 0.3 and 0.4. The second situation possible approaches for proprioceptive terrain shape shown in Fig. 6 is the lower step. The height of the reconstruction: simple kinematics based approach, lower step 0.2 m was correctly measured by the sixth constrained kinematics, constrained kinematics with element of the force sensor. force sensors, and constrained kinematics with both Climbing up and stepping down cases are dis- force sensors and robotic arm—intended for special played in Fig. 7. Variances in the bins that are un- cases, where terrain prediction reaches very high un- derneath the robot are high because we do not have certainty. From the presented qualitative and quan- enough information to estimate the correct heights. titative experimental evaluation we can clearly see Still, the means are correct due to models learnt from that enhancing the sensor suite with force sensor the training data. array proves to be superior. The proposed algo- rithm, which combines Gaussian processes followed 5.2. Quantitative Evaluation by Gibb’s sampling, was successfully implemented As our metric of performance is the absolute error on-board the robot to process the raw force measure- of estimated bin heights which is non-negative, we ments and perform the actual terrain shape prediction prefer to describe its statistical properties by quan- in a probabilistic manner. We certainly do not claim tile characteristics rather than by the means and stan- this is the only and best way to perform such terrain dard deviations. The statistics are computed from the prediction, but, it definitely serves as sufficiently ro- whole testing dataset - i.e. from more than 300 out- bust and accurate proof of concept for intended de- door terrain profiles. ployment. As part of this concept, the integration First, we measure the direct effect of the force of robotic arm for active perception in cases when measurement on the accuracy of the height estimates. the prediction based on force sensors is too uncertain The graph on Fig 8 shows the height error frequency proved to be important. For future work, we aim to of the DEM-bins that are underneath the front flipper. embed additional force sensor arrays on all the four Note that the attribute “underneath the front flipper” robot flippers and extend the terrain prediction algo- is not fixed, it depends on the flipper angle. The force rithm accordingly. sensor indeed improves the accuracy over the using ACKNOWLEDGMENT the flipper angle only. The second experiment studies the statistics for The authors were supported by the Czech Science all the DEM-bins individually, see Fig 9. Adding Foundation GA14-13876S, the EU-FP7-ICT-609763 the kinematic contraint c naturally improves the TRADR project and the Czech Technical University estimates of the bins underneath the robot body grant SGS15/081/OHK3/1T/13. (−0.3 . . . 0.2). Using the force sensors (PAFc) im- proves height estimates of the DEM-bins underneath the front flipper (0.3 . . . 0.5). The bins in front of the flippers, i.e. (0.6 and 0.7) are directly measurable only by the arm exploration. It is thus obvious that including the measurement by arm (PAFAc) has the dominant effect. 6. Conclusions In this paper the aim was to demonstrate a com- bined hardware and software solution that enhances sensor suite and perception capabilities of our mo- bile robot intended for real Urban Search & Rescue missions. We focused our efforts on enabling pro- prioceptive terrain shape prediction for cases when vision and laser measurements are unavailable or de- teriorated (such as in presence of a dense smoke). References [10] J. Shill, E. Collins, E. Coyle, and J. Clark, “Terrain identification on a one-legged hopping [1] I. Kruijff-Korbayová, F. Colas, M. Gianni, robot using high-resolution pressure images,” F. Pirri, J. de Greeff, K. V. Hindriks, M. A. in Robotics and Automation (ICRA), 2014. 2 Neerincx, P. Ögren, T. Svoboda, and R. Worst, “TRADR project: Long-term human-robot [11] D. Inoue, M. Konyo, K. Ohno, and S. Tadokoro, teaming for robot assisted disaster response,” “Contact points detection for tracked mobile KI, vol. 29, no. 2, pp. 193–201, 2015. 1 robots using inclination of track chains,” in In- ternational Conference on Advanced Intelligent [2] L. Ojeda, J. Borenstein, G. Witus, and Mechatronics, 2008, pp. 194–199. 2 R. Karlsen, “Terrain characterization and clas- sification with a mobile robot,” Journal of Field [12] M. Bjorkman, Y. Bekiroglu, V. Hogman, and Robotics, vol. 23, no. 2, pp. 103–122, 2006. 2 D. Kragic, “Enhancing visual perception of shape through tactile glances,” in Intelligent [3] P. Dallaire, K. Walas, P. Giguere, and B. Chaib- Robots and Systems (IROS), 2013. 2 draa, “Learning terrain types with the pitman- yor process mixtures of Gaussians for a legged [13] C. K. Williams and C. E. Rasmussen, “Gaus- robot,” in Intelligent Robots and Systems sian processes for regression,” in Advances (IROS), 2015. 2 in Neural Information Processing Systems 8, D. Touretzky, M. Mozer, and M. Hasselmo, [4] J. Ahtiainen, T. Peynot, J. Saarinen, and Eds. The MIT Press, 1996, pp. 514–520. 3, 5 S. Scheding, “Augmenting traversability maps with ultra-wideband radar to enhance obstacle [14] M. Meier, M. Schöpfer, R. Haschke, and H. Rit- detection in vegetated environments,” in Intelli- ter, “A probabilistic approach to tactile shape gent Robots and Systems (IROS), 2013. 2 reconstruction,” Robotics, IEEE Transactions on, vol. 27, no. 3, pp. 630–635, 2011. 3 [5] J. Sullivan, B. Mitchinson, M. Pearson, M. Evans, N. Lepora, C. Fox, C. Melhuish, [15] J. Romano and K. Kuchenbecker, “Methods for and T. Prescott, “Tactile discrimination using robotic tool-mediated haptic surface recogni- active whisker sensors,” IEEE Sensors Journal, tion,” in Haptics Symposium (HAPTICS), 2014 vol. 12, no. 2, pp. 350–362, 2012. 2 IEEE, 2014, pp. 49–56. 3 [6] M. Pearson, C. Fox, J. Sullivan, T. Prescott, [16] D. Xu, G. Loeb, and J. Fishel, “Tactile identi- T. Pipe, and B. Mitchinson, “Simultaneous fication of objects using Bayesian exploration,” localisation and mapping on a multi-degree in Robotics and Automation (ICRA), 2013. 3 of freedom biomimetic whiskered robot,” in [17] V. Šalansk´y, “Contact terrain exploration for Robotics and Automation (ICRA), 2013. 2 mobile robot,” Master’s thesis, Czech Techni- [7] K. Ohno, S. Morimura, S. Tadokoro, E. Koy- cal University in Prague, 2015, in Czech. 4, 5, anagi, and T. Yoshida, “Semi-autonomous con- 6 trol system of rescue crawler robot having flip- [18] S. Geman and D. Geman, “Stochastic relax- pers for getting over unknown-steps,” in Intelli- ation, gibbs distributions, and the bayesian gent Robots and Systems (IROS), 2007. 2 restoration of images,” Pattern Analysis and [8] K. Zimmermann, P. Zuzanek, M. Rein- Machine Intelligence, IEEE Transactions on, stein, T. Petricek, and V. Hlavac, “Adaptive no. 6, pp. 721–741, 1984. 5 traversability of partially occluded obstacles,” [19] A. O’Hagan and J. Kingman, “Curve fitting in Robotics and Automation (ICRA), 2015. 2, 6 and optimal design for prediction,” Journal of [9] H. Lee, “Development of the robotic touch foot the Royal Statistical Society. Series B (Method- sensor for 2d walking robot, for studying rough ological), pp. 1–42, 1978. 5 terrain locomotion,” Master’s thesis, University [20] C. E. Rasmussen and H. Nickisch. Gpml matlab of Kansas, June 2012, mechanical Engineering. code. 5 2 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 Hessian Interest Points on GPU Jaroslav Sloup, Michal Perd’och, Štěpán Obdržálek, Jiř´ı Matas Center for Machine Perception Czech Technical University Prague sloup|perdom1|xobdrzal|matas @fel.cvut.cz Abstract. environment (e.g. illumination) change. Establishing This paper is about interest point detection and correspondences between interest points represent- GPU programming. We take a popular GPGPU im- ing an object in multiple images is a building step plementation of SIFT – the de-facto standard in fast for a multitude of computer vision tasks, including interest point detectors – SiftGPU and implement stereo or multi-view reconstruction, object recogni- modifications that according to recent research re- tion, and image search and retrieval. sult in better performance in terms of repeatability These are the desirable qualities for which interest of the detected points. The interest points found at point detectors are evaluated: local extrema of the Difference of Gaussians (DoG) • Transformation Covariance. The detected function in the original SIFT are replaced by the lo- points should correspondingly ‘follow’ the ob- cal extrema of determinant of Hessian matrix of the ject as it is depicted from different viewpoints. intensity function. This paper concerns similarity-covariant detec- Experimentally we show that the GPU implemen- tors which follow 2D image locations (objects tation of Hessian-based detector (i) surpasses in at different positions in the image), scales (ob- repeatability the original DoG-based implementa- jects at different distances) and 2D orientations tion, (ii) gives result very close to those of a refer- (in-plane rotation of the objects). Affine detec- ence CPU implementation, and (iii) is significantly tors, which additionally follow out-of-plane 3D faster than the CPU implementation. We show what rotations, are not considered here. speedup is achieved for different image sizes and provide analysis of computational cost of individual • Repeatability of detected interest points. The steps of the algorithm. percentage of the points detected at corre- The source code is publicly available. sponding image locations when the viewpoint changes. • Accuracy 1. Introduction with which the interest points are lo- cated and their scales and orientations are esti- A viewpoint-independent representation of ob- mated. jects in images is one of the fundamental problems • Coverage of various visually different classes in computer vision. A popular approach is to extract of objects. a set of local measurements, known as descriptors, at a sparse set of image locations. These locations • Robustness under image degradation – noise, are called interest points and their purpose is (as op- motion blur, compression, out of focus images, posed to dense image sampling) to reduce the spatial etc. domain of further computation, hence reducing the • cost to obtain, and memory requirements to store, the Detection Speed, the computational cost of the image representation. interest points detection. It follows that for an interest point extraction pro- One of the most popular interest point detection cess to be practical it needs to repeatedly identify the algorithms is still the Scale-Invariant Feature Trans- same points on object surface when the viewpoint or form (SIFT) proposed by David Lowe [8] in 2004. It consistently ranks high on benchmarks in qual- the feature type (saddle, dark or white blob) is now ity of detected points, but is computationally expen- part of the GPU code output. This is useful in follow- sive, therefore unsuitable e.g. for real-time video pro- up matching – features of different types should not cessing. Many speedier approximations and alter- be considered for a correspondence. natives were proposed, e.g. SURF [2], FAST [11] Some of the functionality that was available in and ORB [12], or CenSurE [1] and SUSurE [3], the original CPU SIFT implementation and omitted which can detect interest points significantly faster in the GPU version was reintroduced. We add the than SIFT. But often at the expense of repeatability optional capability to compute orientations and de- and accuracy. scriptors only in h0, πi range instead of h0, 2πi by The only widely used detector that in most tests disregarding sign of the gradients involved, which is scores higher in repeatability than SIFT is the so- beneficial when matching images taken under signif- called Hessian detector. In SIFT, points are identi- icantly different illumination (day and night). The fied at local minima or maxima of the Difference of restriction that at each image location only at most Gaussians function, thence in presence of blob-like two interest point orientations are detected was lifted. local image structures. In the Hessian detector, the And the maximal number of iterations used for sub- points are located where the determinant of the Hes- pixel localization of a detected point is now config- sian matrix (a matrix of second-order partial deriva- urable, the original Sift-GPU code allowed only a tives) attains local extrema. Which occur either for single iteration. blob-like (local maxima) or for saddle-like structures In the rest of the paper we quickly describe the (local minima). Experiments show that the extrema SIFT detector and explain the relations and differ- of the determinant of Hessian are more repeatable ences between the Laplacian operator, the Difference and accurate than the extrema of the Difference of of Gaussians and the determinant of the Hessian ma- Gaussians, and, thanks to the additional detection of trix (Section 2). In Section 3 we sketch the GPU saddle points, the object coverage is generally also implementation and analyze the computational cost improved. The detection speed of Hessian is similar of individual components. Experiments in Section 4 to that of the SIFT. show that the Hessian indeed achieves better perfor- mance than original SIFT and that the GPU and CPU Taking advantage of the recent widespread avail- implementations of Hessian give very similar results. ability of programmable graphic cards, execution time of many computer vision algorithms benefits 2. Laplacian of Gaussian, Difference of Gaus- if reimplemented for GPUs. Interest point detectors sians and Determinant of Hessian Matrix are no exception, a GPGPU (general-purpose GPU) SIFT implementation is available from [15, 14, 4]. Let us consider a grayscale image to be a dis- The SIFTs are detected in real-time for moderately cretized form of an underlying real-valued continu- sized videos or images on a consumer-grade GPU, ous function f (x, y) : 2 R → R. Its Gaussian scale- therefore there is now a large group of applications space representation L(x, y, t) : 3 R → R is then for which it is no longer necessary to sacrifice detec- defined as tion quality for execution speed. L(x, y; t) = g(x, y, t) ∗ f (x, y) We build upon the available GPU SIFT implemen- tation [15] and extend it with several contributions. where The Difference of Gaussians is replaced with the de- 1 g(x, y, t) = e− x2+y2 2t terminant of the Hessian matrix as the function of 2πt which extrema indicate presence of interest points. is a rotationally symmetric 2D Gaussian kernel This improves repeatability, and coverage, of the de- parametrized by variance t = σ2, and where ∗ de- tected points, as is experimentally demonstrated be- notes convolution. Partial Gaussian derivatives of low. Selection of best K points (when ordered by the image at a given scale t are then written as magnitude of the determinant) is implemented in an early stage of the algorithm. If only a specific num- Lxαyβ (·, ·, t) = ∂xαyβ L(·, ·, t) ber of points is requested, it is faster to decide which = (∂xαyβ g(·, ·, t)) ∗ f(·, ·). these are early, on the GPU, before orientations are determined and descriptors computed. Additionally, The Hessian matrix for a given t is a square matrix of second-order partial derivatives The determinant of the Hessian operator has bet- ter scale selection properties under affine image  ∂2(f ∗ g) ∂2(f ∗ g)  transformations than the Laplacian operator or its ∂x2 ∂x ∂y L H =   xx Lxy Difference-of-Gaussians approximation [7]. It was  = . ∂2(f ∗ g) ∂2(f ∗ g)    Lxy Lyy also shown to perform significantly better for image- ∂x ∂y ∂y2 based matching using local SIFT-like or SURF-like image descriptors, leading to higher efficiency and Let λ1 and λ2 denote the eigenvalues of the Hes- precision scores [7]. In an approximation computed sian matrix. Laplacian (or the Laplace operator, the from Haar wavelets it is the basis for the interest sum of second partial derivatives) of the Gaussian is point detector in SURF [2]. then ∇2L = Lxx + Lyy = λ1 + λ2. 3. GPU Implementation and Computation The Laplacian of Gaussian, appropriately normal- Time Analysis ized for different scales [7], is a basis for one of the first and also most common detector of blob- The GPU interest point implementation proceeds like interest points. Local scale-space extrema are in steps shown in Figure 1. First, the input image is detected that are maxima/minima of ∇2L simulta- loaded and transferred to a GPU texture. The scale neously with respect to both space (x, y) and scale pyramid data structures, which make up the major- t [5]. In discrete domain, interest points are detected ity of the GPU memory required, are allocated once if the value of ∇2L at this point is greater/smaller at the beginning, and reallocated only in case a big- than all values in its 26-neighbourhood. Locations ger image is eventually processed later. The alloca- of such points are covariant with translations, rota- tion typically takes several hundreds of milliseconds. tions and rescaling in the image domain. If a scale- Initial image upscaling by a factor of two, which space maximum is found at a point (x is sometimes used in feature detection, is not per- 0, y0; t0) then after a rescaling of the image by a scale factor s formed. The scale space pyramid is then filled – a there will be a corresponding scale-space maximum process that involves smoothing with Gaussian ker- at (sx nels with multiple std. deviations. Keypoints are de- 0, sy0; s2t0) [6]. The Laplacian of the Gaussian operator tected as scale-space extrema of the determinant of ∇2L(x, y, t) can be approximated [7] with a the Hessian matrix and their locations are collected difference between two Gaussian-smoothed images to a linear list. Optionally, the points are ordered by at different scales t and t + ∆t the response (the absolute value of the determinant) and only the top K points are kept for further pro- t ∇2L(x, y; t) ≈ (L(x, y; t + ∆t) − L(x, y; t)) . cessing. Keypoint orientations are then determined, ∆t with approximately 20% of the points ending with two or more orientations assigned. The points, now This approach is referred to as the Difference of with the orientations, are again collected to a list and Gaussians (DoG). In fashion similar to the Laplacian SIFT descriptors are computed. detector, interest points are detected as extrema in the 3D scale-space. The Difference of Gaussian is used Figure 2 shows the execution speed measured on in the SIFT algorithm [8]. three GPU cards. The photo shown on left, which Another differential interest point detector is de- represents a typical picture used in large-scale im- rived from the determinant of the Hessian matrix H age retrieval tasks, was resized to eight different res- olutions. Three CUDA-enabled graphics cards were det HL(x, y; t) = (L tested: NVidia GeForce GT 730M (384 CUDA cores xxLyy − L2 xy ) = λ1λ2. in 2 streaming multiprocessors, 1024MB DDR3 At image locations where the determinant is positive memory, 64-bit bus) is a representative of a common the image contains a blob-like structure. The Hes- mobile/laptop GPU. NVidia GTX 750Ti (640 CUDA sian matrix will there either be positive or negative cores in 5 SMs, 2048MB GDDR5 memory, 128-bit definite, indicating presence of either bright or dark bus) represents a gaming desktop card, and NVidia blobs. If the determinant of the Hessian matrix is GTX Titan Black (2880 CUDA cores in 15xSMs, negative, the matrix is indefinite, which indicates a 6144MB GDDR5 memory, 384-bit bus) is a server saddle-like interest point [5]. card. Additionally, execution times of the reference (Re)Allocate Pyramid Keypoint Linear list of Load image pyramid construc on detec on detected points Mul ‐ GPU code Top K Keypoint orienta on selec on orienta ons Descriptors linear list CPU code Figure 1. Block diagram of the computation pipeline. CPU code shown in yellow, GPU code in blue. Total me (excluding image load and pyramid alloca on) 350 300 250 200 150 Time [ms] 100 50 0 2592x1944 1920x1440 1600x1200 1280x960 1024x768 800x600 640x480 320x240 GT 730M 231,13 142,53 108,01 78,91 60,67 46,57 36,06 20,59 GTX 750Ti 64,08 43,25 35,67 27,57 24,02 19,05 17,15 11,09 GTX Titan 38,5 29,55 24,81 20,75 18,53 16,67 14,95 10,78 i7 4770 306,67 174,79 125,12 86,61 61,3 40,94 29,67 9,91 Figure 2. Detection time for a test image (left) at eight different resolutions (right). Three GPUs were measured, together with a reference CPU implementation. CPU implementation running on a current desktop 4. Experiments CPU (i7 4470) are reported. While the mobile GPU The performance of the proposed GPGPU im- is only slightly faster than the CPU, the other two plementation of the determinant-of-Hessian detec- GPU cards are roughly five and eight times faster. tor was compared with other publicly available de- tectors, based on the Difference of Gaussians, mul- Figure 3 shows a break down of load distribution tiscale Laplacian and the determinant of the Hes- over individual stages of the keypoint detection pro- sian matrix. In particular, we have evaluated cess (refer to Fig. 1). The analysis is shown for the Lowe’s[8] original version of SIFT and its VLFeat desktop (left) and the mobile (right) GPUs. While re-implementation, CPU implementation of the Hes- the desktop card is about five times faster, the pro- sian and the Laplacian, and the original GPU code portional distribution of the load is very similar. of SiftGPU. SURF detector [2], which is based on a fast approximation of the Hessian matrix, is also Comparing the execution speed of the original included. Two sets of experiments are presented: Sift-GPU implementation (using the Difference of first one evaluating transformation invariance of the Gaussians) with our Hessian-based detector, see detectors in terms of repeatability and the number Fig. 4, we observe that the quality improvement of correspondences, second one evaluating perfor- demonstrated below in Experiments comes at no ad- mance in a retrieval system. ditional computational cost. 4.1. Parameter Setting Finally in Figure 5 we show the timing when re- One of the advantages of the determinant of Hes- questing only the best K keypoints. As expected, sian based detector is in responding to an additional the stages preceding the top K selection are not af- type of local features – saddle points [6]. In our ini- fected. The stages following, orientation estima- tial experiments on a large set of images, we ob- tion and computation of the descriptor, take longer served that the number of saddle points in natural for more keypoints, although the increase is sub- images is about the same as the number of bright linear until the GPU processing power is saturated and dark blobs together. Therefore the Hessian gives at around 8000 descriptors computed in parallel. roughly twice as many points as the Laplacian/DoG Time [ms] Timing for GTX 750Ti Time [ms] Timing for GT 730M 70 250 Descriptors Descriptors 60 Mul ‐orienta on linear list Mul ‐orienta on linear list 200 Keypoint orienta on Keypoint orienta on 50 Linear list of detected points Linear list of detected points 150 40 Keypoint detec on Keypoint detec on 30 Pyramid construc on Pyramid construc on 100 20 50 10 0 2592x1944 1920x1440 1600x1200 1280x960 1024x768 800x600 640x480 320x240 2592x1944 1920x1440 1600x1200 1280x960 1024x768 800x600 640x480 320x240 21,86 17,21 14,82 12,49 11,88 9,94 9,05 6,26 62,71 43,23 34,02 27,24 23,46 18,39 15,21 9,35 1,54 1,21 1,25 1,1 1,12 0,78 0,95 0,55 3,06 2,81 2,69 2,52 2,25 1,98 1,89 1,48 3,56 3,3 3,03 2,81 2,58 2,34 2,13 1,51 9,2 7,92 7,17 6,57 5,95 5,36 4,8 3,24 4,99 2,93 2,16 1,5 1,06 0,77 0,61 0,31 23,96 13,31 9,59 6,41 4,15 2,81 1,96 0,77 15 8,57 6,2 4,21 2,95 2,03 1,55 0,81 62,03 34,59 24,45 15,99 10,66 7 4,79 1,89 17,13 10,03 8,21 5,46 4,43 3,19 2,86 1,65 70,17 40,67 30,09 20,18 14,2 11,03 7,41 3,86 Figure 3. Execution time of individual stages of the computation pipeline (refer to Fig. 1), evaluated at several image resolutions, with a default threshold on the detector response. The desktop GPU is about five times faster than the mobile GPU, but the relative load distribution between individual stages is virtually identical. Also the relation of the execution speed and image resolution is similar. Time [ms] Timing for GTX 750Ti Time [ms] Timing for GTX 750Ti 90 250 Descriptors Descriptors 80 Mul ‐orienta on linear list Mul ‐orienta on linear list 200 70 Keypoint orienta on Keypoint orienta on 60 Linear list of detected points Linear list of detected points 150 50 Keypoint detec on Keypoint detec on 40 Pyramid construc on Pyramid construc on 100 30 20 50 10 0 0 2592x1944 1920x1440 1600x1200 1280x960 1024x768 800x600 640x480 320x240 topK = 1 10 100 1000 2000 5000 10000 25000 all 0,48 1,84 6,18 12,15 15,75 24,27 36,89 71,36 144,27 22,4 18,61 15,85 13,6 12,09 11,16 9,24 7,13 0,11 0,29 0,8 1,23 1,5 1,69 2,17 2,82 5,98 1,53 1,4 1,18 1,03 0,98 1,06 0,74 0,78 0,22 0,54 1,5 2,72 3,17 3,75 4,26 5,79 8,91 3,57 3,43 3,14 2,84 2,65 2,48 2,17 1,69 1,63 1,81 2,35 2,65 2,85 2,78 3,12 3,08 0 13,37 9,63 8,13 6,41 5,16 4,99 3,69 2,98 5,62 5,68 6,15 6,48 6,76 6,72 7,01 6,63 5,07 14,67 8,37 5,97 4,02 2,77 1,95 1,4 0,75 22,33 22,29 22,32 22,34 22,35 22,34 22,32 22,3 22,46 20,76 13,12 9,85 7,04 5,32 4,71 3,34 2,29 17,11 17,05 16,79 16,94 17,02 16,63 17,18 16,87 17,01 Figure 4. Execution time of the original Sift-GPU code, Figure 5. Execution times when a limited number of evaluated at several image resolutions. Compare to the K best points is requested. Computed on the full size timing of our Hessian-based detector on the same hard- 2592x1944 image without a threshold on detector re- ware (Fig. 3 left). The improved qualitative performance, sponse. As expected, the processing time of the steps pre-demonstrated in Section 4, comes with a negligible com- ceding the top K selection are not affected, while the later putational cost. steps, most importantly the computation of the descriptor, scale with the number of points requested. detectors, if detector configurations and thresholds used in the detector repeatability experiment. are kept the same. To take an advantage of these additional points while keeping the representations 4.2. Datasets and Evaluation Protocols comparable in size for the experiments, the detected points in each image were ordered by the absolute re- A standard benchmark protocol and dataset for sponse value of the detector and the best 1000, 2000 evaluation of covariant interest point detectors was and 4000 points were selected for evaluation. Fi- proposed by Mikolajczyk et al. [9]. It consists of nally, to diminish the slight differences in detection eight sets, each of six images, with an increasing of dominant orientation, the orientations were fixed effect of image distortions: camera viewpoint, im- to vertical in the retrieval experiment, and were not age scale, isotropic blur, underexposure and image compression. We have selected one scene with each valuable correspondences between the images. With distortion. Ground truth transformations are known, the exception of image blur, the fast but approximate relating reference images of each set to all other im- SURF performs slightly worse than the other meth- ages in that set. The transformations are used to com- ods. pute repeatability scores by considering the overlap error of all pairs of detected points: 4.4. Evaluation in Image Retrieval RE ∩ R The repeatability of a detector predicts its pairwise 1 H> E (R 12 2H−1 12 E , R ) = 1 − , 1 E2 R matching potential. To assess the discrimination abil- E ∪ R 1 H> E 12 2H−1 12 ity of a coupling of a detector (DoG, Laplacian, or where R Hessian) with a descriptor (SIFT), a large-scale im- E represents the elliptic region defined by x>R age retrieval experiment was performed. The Oxford Ex = 1 and H12 is the known homography be- tween the reference 1 and the test image 2. To com- building dataset with about 5000 images was used. pensate for different sizes of regions from different Each detector was again run in three configurations, detectors, a scale factor is applied such that a region requesting at most 1000, 2000, resp. 4000 best inter- R est points. The SIFT descriptor was computed from E is transformed to a normalized size (equivalent 1 to a radius of 30 pixels). Before evaluating the over- a local neighborhood around each point and stored. lap error, region R As there are no significant orientation changes in the E is scaled using the same factor. 2 The image retrieval performance was tested us- dataset, orientation of the interest points was fixed as ing the Oxford buildings dataset and protocol defined vertical in this experiment. The measurement region by Philbin et al. [10]. In short, five queries are de- sizes – radius of the SIFT index w.r.t. the detected fined for each of eleven landmarks in Oxford, and a scale of a interest point – were kept on their default ground truth shortlist of positive examples is given. values: 6.0 for DoG detectors (Lowe, VLFeat), 5.2 For each query an average precision (AP) is com- for Laplacian and CPU and GPU Hessian. The rea- puted as the area below the precision-recall curve. soning behind this is that the DoG detectors return Finally, a mean AP (mAP) is reported for the whole slightly smaller (5-10%) intrinsic scale, determined set of 55 query images. by the smaller of the two subtracted Gaussians. A standard Bag of Words (BoW) approach with 4.3. Evaluation of detectors Repeatability and without Spatial Verification (SV) was used [13, Repeatability is of one of the important properties 10]. SIFT descriptors were quantized into three dif- of interest point detectors. It is a measure that ap- ferent vocabularies for each detector, with: 500k vi- proximates the probability of the point redetection sual words for 1000 points/image, and 1M visual given the distortion between images. The detectors words for 2000 resp. 4000 interest points per im- should be configured to provide comparable numbers age. The TF-IDF scoring in an efficient inverted in- of points to make the assessment fair. The repeata- dex was used to get the BoW ranking. The spatial bility score is complemented with the absolute num- verification estimated a similarity transformation be- ber of corresponding points detected – the predicted tween the query and each of the top 1000 ranked upper bound of the matching problem. Figures 6, 7 images. Finally, images were re-ranked based on and 8 show the measured scores when the number of number of correspondences. The ranking for each detected points was limited to 1000, 2000, and 4000 query was evaluated using Oxford buildings protocol respectively. and an mean Average precision computed as defined We observe that all the three DoG-based detectors in [10]. (original Lowe’s, from VLFeat and SiftGPU) per- The results are summarized in Table 1. The Hes- form virtually the same, as do the two Hessian-based sian detectors consistently outperformed both the detectors (CPU and GPU implementations). This Laplacian and the Difference of Gaussians, regard- strongly indicates that the measured performance is less the size of the representation. Particularly for the indeed inherent of the methods and not of a partic- highest number of interest points per image (4000), ular implementation. We also see that the Hessian where both DoG implementations were struggling to performs in most cases better than the Laplacian and deliver this many points, their performance dropped. its DoG approximation. The additionally detected Thus we can conclude that the complementary saddle saddle points complement the blobs well and provide points detected by the Hessian detector consistently Viewpoint - Graffiti, K = 1000 Zoom+Rotation - Bark, K = 1000 Blur - Bikes, K = 1000 Illumination - Cars, K = 1000 100 100 100 100 GPU Hessian GPU Hessian GPU Hessian GPU Hessian 90 CPU Hessian 90 CPU Hessian 90 CPU Hessian 90 CPU Hessian GPU SIFT GPU SIFT GPU SIFT GPU SIFT 80 CPU SIFT Lowe 80 CPU SIFT Lowe 80 CPU SIFT Lowe 80 CPU SIFT Lowe CPU SIFT VLFeat CPU SIFT VLFeat CPU SIFT VLFeat CPU SIFT VLFeat CPU OpenSURF CPU OpenSURF CPU OpenSURF CPU OpenSURF 70 70 70 70 CPU Laplacian CPU Laplacian CPU Laplacian CPU Laplacian 60 60 60 60 50 50 50 50 40 40 40 40 Repeatability % Repeatability % Repeatability % Repeatability % 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 20 25 30 35 40 45 50 55 60 1 1.5 2 2.5 3 3.5 4 2 2.5 3 3.5 4 4.5 5 5.5 6 2 2.5 3 3.5 4 4.5 5 5.5 6 Viewpoint angle Scale change Increasing blur Decreasing light Viewpoint - Graffiti, K = 1000 Zoom+Rotation - Bark, K = 1000 Blur - Bikes, K = 1000 Illumination - Cars, K = 1000 600 400 700 700 GPU Hessian GPU Hessian GPU Hessian SIFT Lowe CPU Hessian 650 350 CPU Hessian CPU Hessian SIFT VLFeat 500 GPU SIFT GPU SIFT GPU SIFT Laplacian CPU SIFT Lowe CPU SIFT Lowe 600 CPU SIFT Lowe 650 CPU Hessian CPU SIFT VLFeat 300 CPU SIFT VLFeat CPU SIFT VLFeat GPU Hessian CPU OpenSURF CPU OpenSURF CPU OpenSURF 550 GPU DoG 400 CPU Laplacian CPU Laplacian CPU Laplacian 250 500 600 300 200 450 400 550 150 200 N of correspondences N of correspondences N of correspondences 350 N of correspondences 100 300 500 100 50 250 0 0 200 450 20 25 30 35 40 45 50 55 60 1 1.5 2 2.5 3 3.5 4 2 2.5 3 3.5 4 4.5 5 5.5 6 2 2.5 3 3.5 4 4.5 5 5.5 6 Viewpoint angle Scale change Increasing blur Decreasing light Figure 6. Repeatability score and number of correspondences on image sequences with (from left to right): a significant view angle change, scale change, image blur and exposure change. Number of features per image was limited to the best 1000 according to absolute response value. Viewpoint - Graffiti, K = 2000 Zoom+Rotation - Bark, K = 2000 Blur - Bikes, K = 2000 Illumination - Cars, K = 2000 100 100 100 100 GPU Hessian GPU Hessian GPU Hessian GPU Hessian 90 CPU Hessian 90 CPU Hessian 90 CPU Hessian 90 CPU Hessian GPU SIFT GPU SIFT GPU SIFT GPU SIFT 80 CPU SIFT Lowe 80 CPU SIFT Lowe 80 CPU SIFT Lowe 80 CPU SIFT Lowe CPU SIFT VLFeat CPU SIFT VLFeat CPU SIFT VLFeat CPU SIFT VLFeat CPU OpenSURF CPU OpenSURF CPU OpenSURF CPU OpenSURF 70 70 70 70 CPU Laplacian CPU Laplacian CPU Laplacian CPU Laplacian 60 60 60 60 50 50 50 50 40 40 40 40 Repeatability % Repeatability % Repeatability % Repeatability % 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 20 25 30 35 40 45 50 55 60 1 1.5 2 2.5 3 3.5 4 2 2.5 3 3.5 4 4.5 5 5.5 6 2 2.5 3 3.5 4 4.5 5 5.5 6 Viewpoint angle Scale change Increasing blur Decreasing light Viewpoint - Graffiti, K = 2000 Zoom+Rotation - Bark, K = 2000 Blur - Bikes, K = 2000 Illumination - Cars, K = 2000 800 1200 1600 1600 GPU Hessian GPU Hessian GPU Hessian GPU Hessian CPU Hessian CPU Hessian 700 CPU Hessian CPU Hessian GPU SIFT 1400 1400 1000 GPU SIFT GPU SIFT GPU SIFT CPU SIFT Lowe CPU SIFT Lowe CPU SIFT Lowe CPU SIFT Lowe CPU SIFT VLFeat 600 CPU SIFT VLFeat CPU SIFT VLFeat CPU SIFT VLFeat CPU OpenSURF CPU OpenSURF 1200 CPU OpenSURF 1200 CPU OpenSURF CPU Laplacian 800 CPU Laplacian CPU Laplacian CPU Laplacian 500 1000 1000 600 400 800 800 300 400 N of correspondences N of correspondences N of correspondences N of correspondences 600 600 200 200 100 400 400 0 0 200 200 20 25 30 35 40 45 50 55 60 1 1.5 2 2.5 3 3.5 4 2 2.5 3 3.5 4 4.5 5 5.5 6 2 2.5 3 3.5 4 4.5 5 5.5 6 Viewpoint angle Scale change Increasing blur Decreasing light Figure 7. Repeatability score and number of correspondences on image sequences with (from left to right): a significant view angle change, scale change, image blur and exposure change. Number of features per image was limited to the best 2000 according to absolute response value. Viewpoint - Graffiti, K = 4000 Zoom+Rotation - Bark, K = 4000 Blur - Bikes, K = 4000 Illumination - Cars, K = 4000 100 100 100 100 GPU Hessian GPU Hessian GPU Hessian GPU Hessian 90 CPU Hessian 90 CPU Hessian 90 CPU Hessian 90 CPU Hessian GPU SIFT GPU SIFT GPU SIFT GPU SIFT 80 CPU SIFT Lowe 80 CPU SIFT Lowe 80 CPU SIFT Lowe 80 CPU SIFT Lowe CPU SIFT VLFeat CPU SIFT VLFeat CPU SIFT VLFeat CPU SIFT VLFeat CPU OpenSURF CPU OpenSURF CPU OpenSURF CPU OpenSURF 70 70 70 70 CPU Laplacian CPU Laplacian CPU Laplacian CPU Laplacian 60 60 60 60 50 50 50 50 40 40 40 40 Repeatability % Repeatability % Repeatability % Repeatability % 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 20 25 30 35 40 45 50 55 60 1 1.5 2 2.5 3 3.5 4 2 2.5 3 3.5 4 4.5 5 5.5 6 2 2.5 3 3.5 4 4.5 5 5.5 6 Viewpoint angle Scale change Increasing blur Decreasing light Viewpoint - Graffiti, K = 4000 Zoom+Rotation - Bark, K = 4000 Blur - Bikes, K = 4000 Illumination - Cars, K = 4000 2000 1800 3000 3000 GPU Hessian GPU Hessian GPU Hessian GPU Hessian 1800 CPU Hessian 1600 CPU Hessian CPU Hessian CPU Hessian GPU SIFT GPU SIFT 2500 GPU SIFT 2500 GPU SIFT 1600 CPU SIFT Lowe CPU SIFT Lowe CPU SIFT Lowe CPU SIFT Lowe 1400 CPU SIFT VLFeat CPU SIFT VLFeat CPU SIFT VLFeat CPU SIFT VLFeat CPU OpenSURF CPU OpenSURF CPU OpenSURF CPU OpenSURF 1400 CPU Laplacian 1200 CPU Laplacian 2000 CPU Laplacian 2000 CPU Laplacian 1200 1000 1000 1500 1500 800 800 600 1000 1000 N of correspondences 600 N of correspondences N of correspondences N of correspondences 400 400 500 500 200 200 0 0 0 0 20 25 30 35 40 45 50 55 60 1 1.5 2 2.5 3 3.5 4 2 2.5 3 3.5 4 4.5 5 5.5 6 2 2.5 3 3.5 4 4.5 5 5.5 6 Viewpoint angle Scale change Increasing blur Decreasing light Figure 8. Repeatability score and number of correspondences on image sequences with (from left to right): a significant view angle change, scale change, image blur and exposure change. Number of features per image was limited to the best 4000 according to absolute response value. Method Max.feat. Lowe DoG VLFeat DoG CPU Laplacian CPU Hessian GPU Hessian 1000 0.551 0.512 0.572 0.584 0.579 BoW 2000 0.517 0.547 0.568 0.625 0.629 4000 0.558 0.585 0.617 0.643 0.615 1000 0.590 0.554 0.601 0.627 0.621 BoW+SV 2000 0.584 0.594 0.617 0.675 0.678 4000 0.639 0.650 0.692 0.716 0.699 Table 1. Image retrieval experiment. The Bag of Words (BoW) method with and without Spatial Verification (SV) was evaluated with different interest point implementations. Features were limited to best 1000, 2000 resp. 4000 points per image based on detector’s response. The values in the table are the measured mean average precisions, defined in [10]. improve the retrieval performance. [5] T. Lindeberg. Scale-Space Theory in Computer Vi- sion. Kluwer, 1994. 3 5. Conclusion [6] T. Lindeberg. Feature detection with automatic scale selection. IJCV, 30(2):79–116, 1998. 3, 4 We have implemented an interest point detector [7] T. Lindeberg. Image matching using generalized based on the determinant of the Hessian matrix. Such scale-space interest points. Journal of Mathemati- a detector was previously shown, and the observation cal Imaging and Vision, 52(1):3–36, 2015. 3 was confirmed in our experiments, to be superior in [8] D. G. Lowe. Distinctive image features from scale- the quality of detected points to commonly used de- invariant keypoints. International Journal on Com- tectors based on the Difference of Gaussians. Start- puter Vision, 20(2):91–110, 2004. 1, 3, 4 ing with a publicly available GPU implementation of [9] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zis- SIFT detector, we have implemented several modi- serman, J. Matas, F. Schaffalitzky, T. Kadir, and fications and experimentally verified that the perfor- L. V. Gool. A comparison of affine region detectors. IJCV mance indeed improved. The implementation, which , 65(1-2):43–72, 2005. 5 is in CUDA for compatible NVidia graphics cards, [10] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zis- serman. Object retrieval with large vocabularies and was published and made available. fast spatial matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- Acknowledgements nition, 2007. 6, 8 The authors were supported by Toyota Motor Eu- [11] E. Rosten and T. Drummond. Machine learning rope. for high-speed corner detection. In A. Leonardis, H. Bischof, and A. Pinz, editors, Computer Vision ECCV 2006, volume 3951 of Lecture Notes in Com- References puter Science, pages 430–443. Springer Berlin Hei- [1] M. Agrawal, K. Konolige, and M. R. Blas. Censure: delberg, 2006. 2 Center surround extremas for realtime feature detec- [12] E. Rublee, V. Rabaud, K. Konolige, and G. Brad- tion and matching. In D. A. Forsyth, P. H. S. Torr, ski. Orb: An efficient alternative to sift or surf. In and A. Zisserman, editors, ECCV (4), volume 5305 Computer Vision (ICCV), 2011 IEEE International of Lecture Notes in Computer Science, pages 102– Conference on, pages 2564–2571, Nov 2011. 2 115. Springer, 2008. 2 [13] J. Sivic and A. Zisserman. Video google: A text [2] H. Bay, A. Ess, T. Tuytelaars, and L. van Gool. retrieval approach to object matching in videos. vol- Speeded-up robust features (surf). Computer Vision ume 2, pages 1470–1477, 2003. 6 and Image Understanding (CVIU), 110(3):346–359, [14] M. Soltan Mohammadi and M. Rezaeian. Siftcu: June 2008. 2, 3, 4 An accelerated cuda based implementation of sift. [3] M. Ebrahimi and W. W. Mayol-Cuevas. SUSurE: In Third Symposium on Computer Science and Soft- Speeded Up Surround Extrema feature detector and ware Engineering, Sharif University, Tehran, vol- descriptor for realtime applications. pages 9–14, ume 3, 2013. 2 Aug. 2009. 2 [15] C. Wu. SiftGPU: A GPU implementation of scale [4] H. Fassold and J. Rosner. A real-time gpu imple- invariant feature transform (SIFT). http://cs. mentation of the sift algorithm for large-scale video unc.edu/˜ccwu/siftgpu, 2007. 2 analysis tasks. In IS&T/SPIE Electronic Imaging, pages 940007–940007. International Society for Op- tics and Photonics, 2015. 2 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 BaCoN: Building a Classifier from only N Samples Georg Waltner Michael Opitz Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology, Austria {waltner, opitz, bischof}@icg.tugraz.at Abstract. We propose a model able to learn new ob- ject classes with a very limited amount of training samples (i.e. 1 to 5), while requiring near zero run- time cost for learning new object classes. After ex- tracting Convolutional Neural Network (CNN) fea- tures, we discriminatively learn embeddings to sep- arate the classes in feature space. The proposed method is especially useful for applications such as dish or logo recognition, where users typically add object classes comprising a wide variety of represen- tations. Another benefit of our method is the low de- mand for computing power and memory, making it applicable for object classification on embedded de- vices. We demonstrate on the Food-101 dataset that even one single training example is sufficient to rec- ognize new object classes and considerably improve Table 1. One-shot learning results: The top row shows the results over the probabilistic Nearest Class Means used training sample (blue), the other rows are the first (NCM) formulation. 6 results where the our proposed method yields different results than the probabilistic NCM version. Green fram- ing indicates our improved NCM version is correct, while 1. Introduction red stands for the opposite. From left to right: bibimap, creme brulee, hot dog, lobster roll sandwich, seaweed With recent advances in object recognition [7], salad, spring rolls. off-the-shelf features which are learned from a large number of annotated images have become freely available. As datasets grow, it will become increas- ber (n-shot) of training samples. The main purpose ingly computationally demanding to extend models of these methods is to recognize new object classes built from these features to new object classes. Con- from a very limited number of training samples. This sider images of different food types - Japanese and is especially useful for open-ended recognition sce- European users will have quite different imaginations narios, such as logo detection or food recognition, of an average lunch meal. Adapting a pretrained where the number of object classes steadily grows classifier to recognize meals that have not been seen during the life cycle of an object recognition system. during the training procedure is desirable. Similar However, integrating new classes in a classifier pre- to cognitive capabilities of humans, who can learn trained on different classes is not straightforward. On new classes from only very few samples, we aim for the one hand a new class often exhibits large vari- a computer vision system where new classes can be ations, on the other hand the classifier trained on added incrementally. In this work, we consider clas- seen classes may not be capable to generalize to new sification methods which can integrate previously un- classes. Retraining state-of-the-art classifiers such as seen classes from a single (one-shot) or a small num- CNN every time after such class additions leads to Figure 1. Overview of our method: After extracting CNN features from a fine-tuned CNN, we normalize the feature vectors and learn NCM or LMNN embeddings. These embeddings separate the classes in a way, so that newly added classes can be inserted without much loss in overall accuracy compared to insertion without a trained embedding. We leave the CNN and the trained embeddings fixed when adding new classes. high accuracy, but is computationally inefficient and gram Learning (HBPL), where characters are mod- requires a significant amount of memory. For practi- eled as a composition of primitives under certain cal application this is often prohibitive, especially for causality constraints. Another approach for exten- embedded systems. An ideal system should there- sion to new classes or categories is the use of at- fore be able to integrate new object classes from few tributes [9, 11], that can be seen as semantic de- very seen samples on the fly; without the need for scriptors which are shared by multiple classes. New time- and memory-consuming retraining of the clas- classes are then added by generating a semantic de- sifier and negligible performance loss. scription of the new class via attributes. Attribute- based approaches have been used for animal catego- 2. Related Work rization and recognition [9] or for human-nameable visual attributes [11]. Similar to our idea of learning One approach in related works considering n-shot an optimal embedding, [6] statistically infers a Ma- settings is the use of Bayesian learning, where prob- halanobis distance metric on similar and dissimilar abilistic estimates are used to extend the algorithms feature pairs. In [16], a classifier is trained discrimi-to new classes. For example, [3] fit probabilistic natively for nearest prototype classification. Another density functions as category models and use them distance metric learning approach was presented in as prior knowledge for new classes, while using [15], where the authors propose Large Margin Near- one or more samples for generation of the poste- est Neighbor (LMNN) for classification. The LMNN rior model of the new class. Hierarchical Bayesian classifier learns a Mahalanobis metric, so that same models are used by [12], where super-categories are class samples are contracted and samples from differ- automatically discovered based on available classes ent classes are pushed apart from each other. These and serve as prior information to incorporate new methods are able to generalize to previously unseen classes. In [8], authors investigate one-shot learn- samples, but in contrast to our approach they do not ing of characters using Hierarchical Bayesian Pro- regard insertion of previously unseen object classes. where Nc is the number of samples x for class c, We employ a Nearest Class Mean (NCM) clas- f is a feature extraction function and θ are model- sifier [10, 14] for classification. Object classes in parameters. To predict the object class of a sample NCM classifiers are represented by the mean fea- we seek the minimum distance to all class means by ture vector of the corresponding class samples and computing can be easily extended to new classes by computing arg min kf (x; θ) − µ k , (2) the mean over newly added training samples. Our c 2 c=1,...,C method learns discriminative embeddings to better separate the classes in feature space. Other than the where C = |C| is the total number of classes. probabilistic NCM approach of [10], we use CNN 3.1. Feature Extraction features. We propose the hinge loss for optimization and show that this improves overall accuracy. Addi- Motivated by their recent success in image recog- tionally, we show how to robustify the learned NCM nition tasks, we utilize CNNs for feature extraction. embeddings. In contrast to other approaches for one- Instead of training a deep network from scratch, we shot and n-shot learning, our method does not need take a CNN model trained for the ImageNet Chal- access to the full dataset as we do not employ clas- lenge [7] and fine-tune on our task-specific training sifier retraining, enabling the use of our system for data. This can be seen as domain transfer from one embedded platforms like smartphones, where com- task to another and has proven to be superior to hand- puting power and storage is limited. Furthermore, the crafted features [4]. As the later layers of the net- most one-shot algorithms are Bayesian methods and work correspond to high-level features, we use the use prior knowledge from the training data to gen- last fully connected layer as 4096-dimensional fea- erate posterior probabilities for new classes. We do ture representation and normalize each feature vector not model such probabilities, but rely on the learned by dividing by its l2-norm. feature embeddings only. Figure 1 gives an overview 3.2. Embedding of our method: We use l2-normalized CNN features and learn additional layers that embed the features After fine-tuning the CNN, we employ several dis- in an optimal way. After that we add new classes tance metric learning methods to learn a discrimina- d×4096 to evaluate the incremental learning capability of our tive linear embedding matrix W ∈ R , with classifier. Table 1 shows one-shot learning results for d ∈ {1024, 4096}. This embedding projects samples some classes of the Food-101 dataset [2]. from the same object class next to each other in a high dimensional feature space, while simultanously 3. One-Shot and N-Shot Classification pushing samples from different object classes far away from each other. Using the embedding W, the In the n-shot classification setting the classifier ex- class prediction from Equation (2) becomes tends to new classes from a very limited number of samples (i.e. 1 to 5). NCM classifiers store a mean arg min kW · f (x; θ) − W · µ k c , (3) 2 vector for each object class they recognize. This has c=1,...,C the advantage that recognition of new classes can be In this work, we consider optimizing NCM loss func- incorporated by simply computing mean vectors for tions and the LMNN loss with respect to W to learn these classes. Mean vectors can be efficiently com- our embedding. In the remainder of this section, we puted online, eliminating the need of explicitly stor- will formally explain the different methods. ing feature vectors of all training samples that the NCM. As proposed in [10], embeddings for NCM class mean originates from. For one-shot learning, classifiers are usually learned by minimizing the neg- the class mean corresponds to one added class sam- ative log-likelihood. The posterior probability p(c|x) ple, for n-shot learning the mean is calculated from of class c given a sample x is defined as n samples of a new class. More formally, let µc be the mean vector for the c-th class from the set C of e−δ(x,µc;θ)2 p(c|x) = , (4) available classes, defined as PC e−δ(x,µi;θ)2 i=1 N where δ is defined as 1 c X µc = f (xi; θ), (1) Nc δ(x, µ; θ) = kW · f (x; θ) − W · µk . (5) i=1 2 To learn the embedding W, minimize the negative space. Let x be a data sample, y the class label and log-likelihood θ the parameters of the feature extraction function f . We propose to train a NCM layer on top of the CNN N 1 X features with the following NCM loss function L = − ln p(yi|xi) (6) N i=1 X L(x, y; θ) = λ · δ2y + max(0, 1 + δy −δc)2, (11) of sample xi and its corresponding class label yi. In c∈C\{y} subsequent sections we refer to this method as prob- where δ abilistic NCM (N CM y = δ(x, y; θ) and δc = δ(x, c; θ) are dis- P ), since we are optimizing a tance functions as defined in Equation (5) and λ is a negative log-likelihood function. weighting parameter. The first part enforces the sam- LMNN. The loss function of the LMNN embed- ples of one class to be embedded near the class mean ding [15] consists of two terms. One adds a penalty of the data sample, while the second term penalizes if for samples that share the same class label but exceed samples are within the margin of other class means. a certain distance (margin), while the other penalizes This large margin version of the NCM classifier will samples with different class labels that are close in be referred to as N CM feature space. The loss is calculated on triplets in- LM stead of pairs, where a sample is complemented by a 3.4. Robust NCM sample of the same and a sample of a different class. Due to variations in shape, illumination and ap- The set of triplets is given by pearance, feature vectors from an object class usu- D = {(i, j, k) : y ally exhibit intra-class variance. We model this un- i = yj , yi 6= yk} (7) certainty by assuming that a feature vector for a sam- and 1 ≤ i < j < k ≤ N , the LMNN loss function ple x associated with class c is generated by a normal over the triplet set is then defined as distribution N (µc, σc). We incorporate this variation in our model by computing the standard deviation σc X L(D) = d2ij + lijk. (8) for all classes c ∈ C over the training set. During op- (i,j,k)∈D timization we add random noise to our feature vec- tors to account for this uncertainty. More formally, The distance function d of two samples xi and xj is the loss function we minimize is dij = kW · f (xi; θ) − W · f (xj; θ)k (9) 2 X L(x, y; θ) = λ · ˆ δ2y + max(0, 1 + ˆ δy − ˆ δc)2, (12) and the triplet loss function l c∈C\{y} ijk is defined as where lijk = max(0, 1 + dij − dik). (10) ˆ δ(x, µ; θ) = W · ˆ f (x; θ) − W · µ . (13) This embedding maximizes the distance in feature 2 space between samples of different classes (xi, xk), and while concentrating samples that belong to the same 1 ˆ f = f (x) + Σ 2 class (x y · γ. (14) i, xj ). Following [13], during training we perform “hard” negative mining of triplets which vi- Σy is the diagonal covariance matrix of class y, olate the margin constraint imposed by l 4096 ijk . Oppo- ∈ R ∼ N (0, 1) is a random vector drawn from site to “soft” negatives, which do not violate the mar- a normal distribution and γ is a hyper-parameter, gin or violate the margin by only a small amount, which defines the impact of the distortions. In our “hard” negatives impose a high loss and therefore experiments we fix λ to 0.01 and γ is set to 0.5. Dur- lead to faster training of the model and increased per- ing training we first compute the standard deviation formance. of each feature per object class. We then add the noise to our feature vectors, to make the embedding 3.3. Large Margin Nearest Class Mean Classifiers W more robust against inter-class variations. This ro- Inspired by LMNN we propose a large margin loss bustification is done in real time during training and function for NCM classifiers. Ideally, samples from can be seen as data augmentation, making the impact the same class are close to their own mean vector of outliers on the means smaller. We refer to this and are far away from other mean vectors in feature method as N CMLM-R. we train our network with Stochastic Gradient De- scent (SGD) and momentum. We follow standard fine-tuning protocols [7] and use a low initial learn- ing rate of 0.001 and a momentum of 0.9. We anneal the learning rate by a factor of 10 after each 20.000 iterations. To determine convergence, we measure the accuracy on a validation set after 500 gradient updates. We optimize our embeddings with SGD and mo- mentum. For training the embeddings, when not oth- erwise stated, we fix the weights of the CNN and train just the last embedding layer. This allows us to use large learning rates of 0.25-0.5 with a momentum term of 0.9. Further, we use large minibatch sizes of 1024 and train for about 20 epochs. We exponen- tially anneal the learning rate at epoch 15 and 18. To determine convergence, we measure the accuracy on our validation set after each training epoch. In our experiments we use Caffe [5] for fine- tuning, while the evaluations on the embedding methods are implemented in Python utilizing the Theano library [1]. Figure 2. First 8 samples of randomly chosen classes from the Food-101 dataset [2]. From top to bottom: baklava, beef carpaccio, chicken curry, chocolate mousse, fried 4.1. Experiments with Known Classes rice, gnocchi, miso soup, panna cotta, scallops, tacos. To obtain feature representations and a softmax baseline, we fine-tune the pretrained ImageNet Caf- 4. Experiments feNet model from [7] on the 50 training classes For evaluation of our method we use the publicly from the Food-101 dataset as described above. In available Food-101 dataset [2]. It consists of 101 the following, we compare our methods to the soft- food classes with 1000 images per class. The images max classifier (CN Nsoftmax) and to a probabilistic were taken in real world environments, exhibiting a NCM (N CMP ) version related to the work of [10]. lot of variation in illumination conditions or food ar- The first results are obtained by nearest class mean rangement (see Figure 2 for some examples) and are classification, using euclidean (CN Neuc) and co- well suited for the targeted application case where sine (CN Ncos) distance measures between the class users add data continuously. means of all trainings samples and the test sam- Following the protocol in [2], we randomly split ples. Subsequently we train our NCM and LMNN the 1000 samples of each class into 750 for training embedding layers on top of the fine-tuned net- and 250 for testing. For training of the CNN, we work while leaving the net weights fixed (N CMLM , then apply a 80%/20% split for training and valida- N CMLM-R, LM N N ). A summary of the results is tion (600 and 150 samples respectively). This results depicted in Table 2. in a training-, validation- and test-set with 60.600 Interestingly, the nearest class mean classifica- (60%), 15.150 (15%) and 25.250 (25%) samples, re- tion performs better than the probabilistic version of spectively. Further from the 101 classes we randomly NCM, implying that the CNN features already sep- select 50 training classes on which we train our clas- arate the classes well. Our robust NCM version im- sifiers and 51 classes on which we evaluate the gen- proves results over the probabilistic version by about eralization capability of our method to novel classes. 2% and is very competitive in comparison to the end- For the sake of completeness, we also evaluate the to-end trained softmax classifier of the network. The embeddings on the 50 training classes only. NCM embedding trained on the hinge loss and the For fine-tuning CaffeNet on the Food-101 dataset, LMNN embedding also reach comparable accuracy. Method Emb. n = 1 n = 5 n = 10 n = 20 n = 50 n = 100 CN Neuc − 44.15 ± 0.08 49.30 ± 0.31 54.26 ± 0.30 57.66 ± 0.20 60.03 ± 0.15 60.83 ± 0.13 CN Ncos − 44.92 ± 0.24 49.82 ± 0.34 53.92 ± 0.32 57.26 ± 0.21 59.86 ± 0.16 60.76 ± 0.13 LDA − 44.51 ± 0.01 44.92 ± 0.12 45.85 ± 0.12 48.44 ± 0.18 57.87 ± 0.17 63.61 ± 0.16 N CMP 1024 45.55 ± 0.33 50.11 ± 0.32 51.89 ± 0.25 53.08 ± 0.20 54.03 ± 0.13 54.39 ± 0.09 N CMP 4096 45.62 ± 0.34 50.23 ± 0.33 52.03 ± 0.25 53.23 ± 0.20 54.20 ± 0.14 54.57 ± 0.09 N CMLM 1024 46.23 ± 0.32 51.46 ± 0.34 53.49 ± 0.26 54.88 ± 0.18 56.02 ± 0.15 56.44 ± 0.11 N CMLM 4096 45.97 ± 0.28 51.43 ± 0.34 53.51 ± 0.27 54.93 ± 0.19 56.06 ± 0.13 56.50 ± 0.11 N CMLM-R 1024 46.30 ± 0.32 51.95 ± 0.35 54.15 ± 0.26 55.67 ± 0.21 56.89 ± 0.15 57.37 ± 0.11 N CMLM-R 4096 46.28 ± 0.32 51.94 ± 0.35 54.13 ± 0.28 55.66 ± 0.21 56.85 ± 0.15 57.31 ± 0.11 LM N N 1024 45.78 ± 0.25 51.60 ± 0.32 53.82 ± 0.28 55.34 ± 0.21 56.57 ± 0.16 57.05 ± 0.12 LM N N 4096 45.29 ± 0.18 51.58 ± 0.33 54.10 ± 0.28 55.81 ± 0.21 57.14 ± 0.14 57.63 ± 0.11 SV M − 46.52 ± 0.40 50.02 ± 0.35 52.25 ± 0.30 55.08 ± 0.24 59.74 ± 0.19 63.38 ± 0.17 Table 3. Classification accuracy over the full Food-101 test-set (250 samples per class) after adding n ∈ {1, 5, 10, 20, 50, 100} training samples for each of the 51 test-classes. Accuracy and standard deviation are calculated over 100 runs. The baseline accuracy for end-to-end training of the CNN on all classes with all available data is 66.63%. The best and the second best result in each column is shown in bold and underlined. Method Emb. size Accuracy pute the class means that represent the new classes. CN Neuc − 68.60 Table 3 shows, that fine-tuning the network in the CN Ncos − 68.64 training phase (known classes) with metric learning N CMP [10] 1024 67.66 methods generally improves accuracy in the testing N CMP [10] 4096 67.75 phase for smaller values of n. Training the CNN with N CMLM 1024 69.00 Caffe on the full dataset of 101 classes converges af- N CMLM 4096 69.14 ter approximately 100.000 iterations to 66.63%. N CMLM-R 1024 69.68 We also trained two more standard classifiers on N CMLM-R 4096 69.61 the CNN features, namely SVM and LDA. It is re- LM N N 1024 69.20 markable, that although the SVM has access to the LM N N 4096 69.11 full dataset, the performance compared to our pro- CN Nsoftmax − 70.26 posed methods is inferior for n ∈ {5, 10, 20}. The Table 2. Classification results of the 50 classes used for same applies for utilizing a LDA classifier, where fine-tuning the CNN model for feature extraction. Our proposed robust NCM version reaches almost the same only a big number of new samples achieves a per- accuracy as the end-to-end trained softmax classifier while formance improvement compared to our proposed improving the results over the standard probabilistic NCM methods. classifier by 2%. Best (bold) and second best (underlined) embeddings are marked. 5. Conclusion 4.2. Introducing Unseen Classes We introduced embedding methods for one-shot To assess how our method generalizes to new and n-shot object class recognition. Our proposed classes from only a limited number of samples, we extensions to NCM classifiers consistently improve use n random samples from the training set of the re- the accuracy over the standard NCM training formu- maining 51 classes to compute the mean vectors from lation in a scenario where the amount of classes to the output of the embeddings. The embeddings and be recognized by the classifier doubles. Our meth- the CNN remain fixed and are not retrained, hence ods perform best for settings where only very few the addition of new classes reduces to storing the new new samples (n ≤ 10) per class are available. The class means. We choose n ∈ {1, 5, 10, 20, 50, 100} extension of the classifier to new object classes is in- and report the accuracy on the full Food-101 test- dependent of the old training data and is efficient in set, where every class is represented by 250 samples. terms of computational expense and memory. This is Since for small values of n the results might have a especially useful for recognition systems running on large standard deviation, we repeat these experiments embedded devices, where CPU power and memory 100 times using different training samples to com- is limited. Acknowledgements [12] R. Salakhutdinov, J. B. Tenenbaum, and A. Torralba. One-Shot Learning with a Hierarchical Nonparamet- This work was supported by the Austrian Re- ric Bayesian Model. In Workshop on Unsupervised search Promotion Agency (FFG) under the projects and Transfer Learning in conjunction with the Inter- MANGO (836488) and DIANGO (840824). national Conference on Machine Learning, 2012. [13] F. Schroff, D. Kalenichenko, and J. Philbin. References FaceNet: A Unified Embedding for Face Recogni- [1] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, tion and Clustering. In IEEE Conference on Com- R. Pascanu, G. Desjardins, J. Turian, D. Warde- puter Vision and Pattern Recognition, 2015. Farley, and Y. Bengio. Theano: a CPU and GPU [14] A. R. Webb and K. D. Copsey. Statistical Pattern Math Expression Compiler. In Proceedings of the Recognition. Wiley, 3rd edition, 2011. Scientific Computing with Python Conference, June [15] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Dis- 2010. tance Metric Learning for Large Margin Nearest [2] L. Bossard, M. Guillaumin, and L. Van Gool. Food- Neighbor Classification. In Advances in Neural In- 101 – Mining Discriminative Components with Ran- formation Processing Systems, 2005. dom Forests. In European Conference on Computer [16] P. Wohlhart, M. Köstinger, M. Donoser, P. M. Roth, Vision, 2014. and H. Bischof. Optimizing 1-Nearest Prototype [3] L. Fei-Fei, R. Fergus, and P. Perona. One-Shot Classifiers. In IEEE Conference on Computer Vision Learning of Object Categories. IEEE Transac- and Pattern Recognition, 2013. tions on Pattern Analysis and Machine Intelligence, 28(4):594–611, 2006. [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich Feature Hierarchies for Accurate Object Detec- tion and Semantic Segmentation. In IEEE Confer- ence on Computer Vision and Pattern Recognition, 2014. [5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093, 2014. [6] M. Köstinger, P. Wohlhart, P. M. Roth, and H. Bischof. Joint Learning of Discriminative Pro- totypes and Large Margin Nearest Neighbor Clas- sifiers. In IEEE International Conference on Com- puter Vision, 2013. [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Ima- geNet Classification with Deep Convolutional Neu- ral Networks. In Advances in Neural Information Processing Systems, 2012. [8] B. M. Lake, R. R. Salakhutdinov, and J. Tenen- baum. One-Shot Learning by Inverting a Compo- sitional Causal Process. In Advances in Neural In- formation Processing Systems, 2013. [9] C. H. Lampert, H. Nickisch, and S. Harmel- ing. Learning to Detect Unseen Object Classes by Between-class Attribute Transfer. In IEEE Confer- ence on Computer Vision and Pattern Recognition, 2009. [10] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Distance-Based Image Classification: Generalizing to New Classes at Near-Zero Cost. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 35(11):2624–2637, 2013. [11] D. Parikh and K. Grauman. Relative Attributes. In IEEE International Conference on Computer Vision, 2011. 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 Cuneiform Detection in Vectorized Raster Images Judith Massa1, Bartosz Bogacz1, Susanne Krömker2 and Hubert Mara1 Interdisciplinary Center for Scientific Computing (IWR) 1Forensic Computational Geometry Laboratory (FCGL) 2Visualization and Numerical Geometry (NGG) Heidelberg University, Germany {judith.massa|bartosz.bogacz|susanne.kroemker|hubert.mara}@iwr.uni-heidelberg.de Abstract. Documents written in cuneiform script are one of the largest sources about ancient history. The script is written by imprinting wedges (Latin: cunei) into clay tablets and was used for almost four millennia. This three-dimensional script is typi- cally transcribed by hand with ink on paper. These transcriptions are available in large quantities as raster graphics by online sources like the Cuneiform Database Library Initative (CDLI). Within this arti- cle we present an approach to extract Scalable Vec- Figure 1: A tracing of the tablet VAT6546 [23] tor Graphics (SVG) in 2D from raster images as we previously did from 3D models. This enlarges our writing could produce robust documents. Therefore, basis of data sets for tasks like word-spotting. In the the content of cuneiform tablets ranges from simple first step of vectorizing the raster images we extract shopping lists to treaties between empires. The num- smooth outlines and a minimal graph representation ber of known tablets is assumed to be in the hundreds of sets of wedges, i.e., main components of cuneiform of thousands, which is constantly increasing as new characters. Then we discretize these outlines fol- tablets are excavated by archaeologists on a regular lowed by a Delaunay triangulation to extract skele- basis. By roughly estimating the number of words tons of sets of connected wedges. To separate the on those tablets, we can assume that the total amount sets into single wedges we experimented with differ- of text in cuneiform script is comparable to those in ent conflict resolution strategies and candidate prun- Latin or Ancient Greek. ing. A thorough evaluation of our methods and its Since 1999, a number of projects have been parameters on real word data shows that the wedges launched to facilitate the work of Assyriologists. The are extracted with a true positive rate of 0.98. At Digital Hammurabi Project is concerned with the the same time the false positive rate is 0.2, which re- digitization of cuneiform tablets [27]. Achievements quires future extension by using statistics about geo- of the project include the creation of high-resolution metric configurations of wedge sets. 3D models [17] as well as 3D and 2D visualiza- 1. Introduction tion techniques for the models. Similarly, projects in Leuven deal with the efficient production of 3D Documents were written in cuneiform script for models of tablets [28] and techniques to visualize the more than three millenia in the ancient Middle models [13]. The Cuneiform Digital Library Initia- East [26]. Cuneiform characters were typically writ- tive [15] incorporates a number of projects aimed at ten on clay tablets by imprinting a rectangular sty- cataloging cuneiform documents and making them lus and leaving a wedge (cuneus in Latin) shaped available online as transliteration, tracing and 2D im- trace, i.e., triangular markings. As clay was always age. In [11], the software framework CuneiformAna- cheaply and easily available, everybody capable of lyzer is introduced. It assists the researchers in script 3. Implementation Our implementation proceeds in five distinct steps. The first three steps transform raster image transcriptions into a set of skeletonized wedge con- stellations. The final two steps extract and prune (a) (b) wedge candidates from these constellations. Figure 2: (a) Paths described by four distinct wedge shapes and (b) the path described by the compound 3.1. Vectorization shape formed by them. For the vectorization step, Selinger’s potrace algo- rithm 1 is used. A directed graph G1 is constructed analysis by detecting and segmenting wedge impres- by traveling along the edges between black and white sions of 3D models [10]. Furthermore, the program pixels. Thereby, each vertex v in the graph corre- simplifies collation of fragments and reconstruction sponds to a pixel corner, which is adjacent to four of tablets with methods from 3D computer graphics. pixels in the bitmap image of which at least one The GigaMesh project contributes with visualization has to be black and one has to be white. An edge methods and extraction of cuneiform characters from (vi, vi+1) between two vertices is created if the cor- tablets [21, 20]. responding corners are neighbors in the bitmap im- Our method extracts wedge-shaped impressions age and the edge separates a black and a white pixel. from raster images. These images are hand-drawn Then, a path p = {v0, . . . , vn} is a sequence of ver- transcriptions of cuneiform tablets of varying quality tices, where there is an edge between each pair of and two different styles of marking wedges. We vec- consecutive vertices vi and vi+1 for i = 0, . . . , n − 1. torize images of the transcriptions and match patterns A path is called closed if v0 = vn. Whenever a and shapes to detect these constellations of wedges in closed path is found, the color of the pixels enclosed the vectorized transcriptions. by it is inverted. The algorithm is applied recursively to the new image until there are no black pixels left. 2. Related Work For each of the resulting paths a polygon is calcu- lated. Therefore another directed graph G In [7] the problem of content-based image re- 2 is con- structed, where each edge represents a straight path trieval of Scalable Vector Graphics (SVG) docu- and the set of vertices of Graph G ments is tackled. Their approach uses a description 2 is a subset of the vertices of Graph G language to simplify comparisons between shapes. It 1 reduced to the endpoints of the straight paths. A path p = {v represents an object by a basic shape, like a unit cir- 0, . . . , vn} is called straight if for all index triples (i, j, k) with cle, and a transformation entailing its scale and trans- 0 ≤ i < j < k ≤ n there exists a point w on the lation from the origin. The resulting framework han- straight line through v dles composites of simple SVG shapes, but no SVG i and vk such that d(vj , w) ≤ 1. The function d denotes the Euclidean norm. Fur- path elements, which are able to represent arbitrary thermore, not all four possible vertex-to-vertex direc- shapes. The similarity measure chosen is a weighted tions v sum of shape, color, transformation, spatial and po- (i+1) − vi may occur in the path (Figure 3). Each edge then is assigned a penalty P sition similarity. i,j for us- ing the corresponding straight path for the resulting In [18] the problem of hierarchically clustering polygon. The penalty is the product of the Euclidean shapes described as vector graphics is addressed. length and the standard deviation of vertex distances. Based on [29], Kuntz uses Fourier descriptors [6] to describe and compare single basic SVG shapes. The j descriptors serve as feature vectors which then are X Di,j = dist(vk, vivj)2 (1) clustered using state-of-the-art clustering algorithms. k=i We cannot apply Kuntz’ method since it does not r 1 deal with shapes that are part of a compound de- Pi,j = |vi − vj| · · Di,j (2) j i + 1 scribed as one object. Yet, our input consists of SVG path elements that usually describe such compound 1http://potrace.sourceforge.net. Project page of potrace. Last shapes (Figure 2). visited on 4/11/15. as upper bound: 2 X s ≤ kCi+1 − Cik. (4) i=0 Since smaller distances along the path can only improve the quality of the resulting skeleton, this up- Figure 3: Shows how potrace checks if paths are per bound is used for discretizing the silhouette. straight. The dots represent the vertices of the paths and the squares the 1/2-neighborhoods of the ver- 3.3. Skeletonization tices. Paths in (a), (b) and (d) are straight and (c) In order to deal with occluding wedge marks in and (e) are not. the detection step, shape skeletons are used as in- termediate representations. Different definitions for shape skeletons have been stated since [2, 22, 25, 19]. with j i = j −i if i ≤ j and j i = j −i+n if j ≤ i − → Based on [25], we define a shape skeleton as the in- and dist(a, cd) the Euclidean distance of a point to finite set of points within the shape boundaries that a straight segment. Finding an optimal polygon then have more than one closest point on the shape out- is the equivalent to finding an optimal cycle in graph line. The skeleton can be computed efficiently with G2, with the quality measured by the tuple (k, P ), time complexity O(n log n) by using Voronoi dia- where k is the number of straight paths that make up grams [16]. The Voronoi diagram for a set of sites S the cycle and P is the sum of respective penalties. divides a space into |S| partitions called the Voronoi With that, a polygon with a smaller penalty but more regions. In 2 R a Voronoi region is the interior of segments is considered worse than a polygon with a convex polygon, whose boundaries, the Voronoi less segments but higher penalty. edges, are equidistant to two of the input sites. As After choosing a polygon, Bézier curves are cal- input sites, the polygon vertices obtained in the pre- culated and by doing this, a smoothing of the corners vious discretization step are used. is achieved where it seems reasonable. Optionally, The Voronoi diagram (Figure 4b) is computed by consecutive curves are joined if the segments agree solving the dual problem first: the Delaunay trian- in convexity and the total direction change does not gulation (Figure 4a). Each Voronoi vertex represents exceed 89 degrees. the circumcenter of a Delaunay facet and a Voronoi ridge connects two such points of neighboring facets. 3.2. Discretization An implementation of the quickhull algorithm [1] is used to calculate the 2-dimensional Delaunay trian- We tested minimizing the maximum distance be- gulation from a 3-dimensional convex hull. tween the polygon line segments and the contour seg- The Voronoi ridges with end points outside the ments, but found that even if a polygon approximates original shape boundaries and ridges crossing the an arbitrary shape well, it is still not assured that the contour are removed. The skeleton, represented as resulting Voronoi skeleton will be a good approxima- an undirected graph (Figure 4c), consists of more tion to the skeleton of the wedge constellation. The and in general shorter segments. These, in turn, are key for a good discrete skeleton turned out to be the made up of longer segments the more vertices form limitation of the distance of two sample points along the approximated polygon. Since the short segments the shape outline. However, calculating the distance are rarely meaningful, considering their directions, a along a Bézier curve B(t) is a complex task [12] new skeleton (Figure 4d) is constructed. The com- since the length s of the complete curve is putation is done by a graph traversal algorithm. It follows a series of consecutive edges until an end Z 1 q s = B0 node is incident to more than two edges in the origi- x(t)2 + B0y (t)2dt, (3) 0 nal skeleton. 3.4. Extraction which has no closed-form solution. Yet, we know that the curve length s of a Bézier of degree 3 has the The basic shape of a wedge impression can be de- sum of the distances of consecutive control points Ci scribed by a Y- or T-junction. We call this junction (a) (b) (a) (b) Figure 5: Two different ways of representing a wedge impression in ink tracings: a) as unfilled contour and b) as filled shape. The result of the bitmap trace is drawn in red, the simplified Voronoi skeleton in blue. (c) (d) Figure 4: Visualization of important steps of the ure 5a). This fact is used to locate the wedge-head skeleton computation and simplification process: (a) of a wedge contour. A cycle in an undirected graph the Delaunay triangulation, (b) the Voronoi diagram, is an ordered set of vertices (c) the inner elements of the Voronoi diagram and (d) C = (v the resulting skeleton. 0, v1, . . . , vn) (5) where circular consecutive vertices are adjacent and the wedge-head and the ridges extending from the no vertex appears twice. junction the wedge-arms. To avoid outliers to be taken into consideration as After having computed a shape skeleton, the de- wedge-heads, we only look for short cycles. Two tection of the wedge-heads of the impressions can be concepts of length are possible: the number of edges approached. There are two different ways a wedge forming the cycle impression can be drawn: as contour lines or shapes filled with ink. For a single wedge impression, the ledge(C) = |C| ≤ tedge (6) filled shape results in a single closed curve after and the accumulated distances ldist along the cycle bitmap tracing and the shape contour is represented path, using Pv as the coordinate of a node v, that can as two closed curves, where only the area between be formally defined by both curves is filled with color. Usually, the repre- sentation as unfilled shape contour is intended, but |C|−1 X for small wedges, the thickness of the pencil used ldist(C) = |Pv − P | ≤ t i vj dist (7) for the original ink tracing sometimes leads to solid i=0 shapes. The two representations result in two differ- with vi, vj ∈ C and j = (i + 1) mod |C| thresholds ent skeletons as shown in Figure 5. These two cases tedge and tdist. are considered separately and certainty values wloc Using ledge as the cycle length, this results in a are calculated for locations in the skeleton graph that time complexity of O(|E| · tedge), using ldist, the j k seem likely to contain a wedge-head. In both cases, complexity will be O(|E| · tdist min{kP w u−Pv k:(u,v)∈E } loc ranges from zero to one with values close to one in the worst case. These concepts are used next to indicating a high probability of a wedge-head at the each other in the algorithm. considered location. The result of this step is a set of The cycle extraction proceeds as follows: A wedge-heads for which the certainty value exceeds depth-first search tree is built and the back-edges are the threshold tcontour for contour wedges or tsolid for loc loc extracted. For each back-edge (u, v) a depth-limited solid wedges. search for node v is conducted with u as root node; paths from v to u represent cycles when joined with Wedge-Head Detection of Shape Contours Hav- {(u, v)}, except the direct path {(v, u)}. At last, the ing a wedge impression represented as a contour, the set of cycles is reduced to contain only unique cycles. respective skeleton graph shows a cycle resembling A set of unique cycles contains no two cycles that a triangle at the position of the wedge-head (Fig- are equivalent, i.e., if they are induced by the same set of graph edges. This is tested by: C1 ≡ C2 ⇐⇒ EC \ E = ∅, (8) 1 C2 where EC denotes the edge set of a cycle C. The depth-limited search stops following a search branch when either ledge or ldist exceed their respec- tive thresholds, tedge or tdist, or when a target node Figure 6: The triangle with area A∆ shows the best is discovered. It returns a list of paths from the root triangle for the blue polygon. Since the sum of the to the target node. error areas Aerr = A1+A2+A3 is almost equal to the triangle area A∆, the polygon with area A∆+Aerris Triangle Similarity Once all unique cycles of a not one of the polygons chosen for wedge-head posi- skeleton graph have been extracted, their resem- tions. blance to a triangle can be analyzed. This can be achieved by comparing the triangle with the small- est error that can be created with the cycle’s vertices always holds by definition of the Voronoi diagram. with the original cycle (Figure 6). With Therefore, any random s ∈ S(v) can be chosen to get a measure for the distance to the contour and use 2 X it as hint for plausible locations of heads of solid AC,(i0,i1,i2) err = A(ci ,c ) (9) j ij +1 mod |C|,...,cij+1 wedges (Figure 7). Percentiles of the site-to-vertex j=0 distances being the sum of error areas between triangle and polygon, we have D = {d(Pv, s)|v ∈ S ∧ s ∈ S(v)} (13) ( C,(i ) are used instead of the minimum and maximum to ac- A 0,i1,i2) err wcontour count for outliers. In order to arrive at a range from loc (C) = 1 − min i C,(i 0,i1,i2; A 0,i1,i2) C + Aerr zero to one with values close to one for long distances (10) d(pv, s) and close to zero for short distances, the first for 0 ≤ i0 < i1 < i2 ≤ |C| − 1, C = (c0, . . . , cn) percentile is used as minimum dlower, the 99th per- and n ≥ 2. centile of these distances as maximum dupper and the The advantage of this similarity measure is that the position within this range is used as certainty value. three vertices of the triangle are also vertices of the skeleton graph, thus providing us with feasible start- ing points for the wedge-arm tracing. Since the cy- 0 if d(P  v , s) ≤ dlower cles form simple polygons, the enclosed areas can be  wsolid loc (v) = 1 if d(Pv, s) ≥ dupper calculated with the shoelace or surveyor’s area for-  d∗(P mula [4]. v , s) else (14) Wedge-Head Detection of Solid Shapes Solid im- d(Pv, s) − dlower prints have their centers at junctions v of a skeleton S d∗(Pv, s) = (15) dupper − dlower with a particularly long distance from the shape con- tour. This distance is approximated by the distance For a junction, where n edges meet, n wedge- 3 of the coordinate Pv of the vertex v to all sites s with heads are retrieved. s ∈ S(v), where S is the site set of the underlying After locating the position of a wedge-head, the Voronoi diagram and extents of the impression are calculated. For a wedge-head, multiple wedges are proposed. Vertices S(v) = {s ∈ S|Pv is vertex of V (s)} (11) {vi}i=1,2,3 must fulfill two conditions: is the set of sites with Pv being a vertex of their Voronoi region V (s). The equation 1. The line segment between the coordinates of a wedge vertex vi and tracing start point S may d(Pv, s1) = d(Pv, s2) ∀s1, s2 ∈ S(v) (12) not intersect with the shape boundary. (a) Figure 7: Possible locations of solid wedges are found by looking at skeleton junctions: if their dis- tance to the closest discretization points is above a threshold, it is likely that the center of a wedge im- pression is located here. Due to the definition of the Voronoi skeleton, the equation d1 = d2 = d3 holds at every skeleton junction. (b) Figure 8: The pink area shows the place where wedge 2. A vertex of a wedge must be located within the vertices may be located given the wedge-head of (a) infinite area between the lines through the co- a contour wedge and (b) a solid wedge. The marked ordinates of S and the wedge-head vertices vH j junctions are valid as wedge vertex since the straight and vH as shown in Figure 8. The condition can k line to N does not cross the shape boundary. The be checked by testing if dotted line in (b) shows that if we had chosen N for −−−→ α solid wedges as for contour wedges, we would not be ](PSPv , ~v) ≤ (16) i 2 able to find the correct wedge vertex. with −−−−→ −−−−→ α = ](P P P P vH S , vH S ) (17) j k candidates that can be easily identified as improba- −−−−→ ble by computing the angles between the arms. We and ~ v being the angle bisector of P P vH S and j −−−−→ want the arms to be evenly spread, so we punish an- P P vH S . gles that deviate from 120 degrees. Having α1, α2 k and α For contour wedges, the tracing start node for a 3 as internal wedge angles, we use wedge vertex vi is the respective wedge-head ver- w tex vH and for a solid wedge, the start node is the angle(α1, α2, α3) = p(α1) · p(α2) · p(α3) (18) i wedge center. The reason why for the contour ver- with tex, the line checked for the conditions above starts 120 at the head vertex instead of the wedge center is that p(α) = (19) 120 + |120 − α| the center in this case is not a part of the shape skele- ton and is typically located inside a hole in the shape. as measure for the angle quality of wedges. This From the respective start node the algorithm fol- measure is used in a preliminary reduction step to lows all paths within the area of valid nodes shown in eliminate all wedges whose angle quality exceed a Figure 8. A path may have sections of a certain num- given threshold. ber of nodes that are inadmissible as wedge vertices. The simplest strategy using a threshold takes the The algorithm returns multiple arms for one direction remaining wedges and removes those that share resulting in multiple wedge suggestions for a wedge- heads with other wedges having higher angle qual- head. If narm 1, narm 2 and narm 3 are the number of ity. All of the other strategies proceed by iteratively arms returned for the respective wedge-head vertex, testing wedges against the set of chosen wedges and the number of wedges is narm 1 · narm 2 · narm 3. adding them to the result set if no conflicts arise. Wedges are not added if the number of wedge-head 3.5. Wedge Set Reduction edges, that are not used by any other wedge as head The certainty measures for wedge-head locations or arm edge, goes below a threshold. Furthermore, wloc can only be used as first hints to possible loca- contour wedges may not share a head edge with an- tions. After wedge-arm tracing, we still get wedge other wedge. Balanced Strategies For documents where the number of solid wedges is greater or equal to the number of contour wedges, we implemented bal- anced strategies. There are six different strate- gies of this kind: Balanced-loc, balanced-angle and balanced-size sort the wedges by wloc, wangle and size respectively. Balanced-sides-loc, balanced- (a) (b) sides-angle and balanced-sides-size sort the wedges first by the number of arms that contain at least one Figure 9: ROC curves of the measures used for the edge that is not already used by a chosen wedge. The detector of (a) contour wedges and (b) solid wedges second kind of balanced strategies recalculates the number of free arms after each iteration. As mea- sure for the size of a wedge, the average length of the Contour Wedges Candidates for contour wedges lines from the center to the three edges is taken. are found by searching for cycles in the skeleton graph. The similarity of the cycle to a triangle is Contour-Fill Strategies Most documents contain then used as quality measure for the detector for early more contour wedges than solid wedges. For these rejection of improbable locations for wedge heads. documents, the contour-fill strategies have been Figure 9a shows that early rejection is reasonable, implemented. The strategies are contour-fill-loc, since the chosen function for the location proves to contour-fill-angle and contour-fill-size, contour-fill- be a good estimator. However, the green curve shows sides-loc, contour-fill-sides-angle and contour-fill- that the chosen measure for the angles between the sides-size. They proceed as their respective balanced arms of a reconstructed wedge is less optimal as dis- counterpart but consider the set of contour wedges criminator. first before adding solid wedges to the set of cho- Figure 10a shows the F-score for the contour sen wedges. The candidate set of solid wedges only wedge detector using thresholding only. It shows calculated after the set of chosen contour wedges is high scores of about 0.8 to 0.9 for thresholds for the computed and vertices that are incident to a wedge- location quality of about 0.7 to 0.9. The threshold for head edge are excluded. the angle quality should not be chosen too high, but 0.7 at most. The maximum score of 0.90 is achieved 4. Results for tcontour = 0.79 and the threshold t loc angle = 0.45. The algorithm has been tested on 94 tracings from [23] and [14]. The groundtruth is determined by manually deciding for each cycle and skeleton junc-Solid Wedges Candidates for solid wedges are tion if they are valid positions of wedge-heads. Since found by searching for skeleton junctions with great a tracing rarely contains less than 500 wedge-marks, distance to the shape contour. Figure 9b shows that two typical tracings have been chosen for the evalua- weight is a worse discriminator for solid wedges tion. They differ in the representation of fractures, in than for contour wedges. The discrimination qual- size and in the percentage of solid wedges. As result ity for the angle quality measure for solid wedges we have 1252 annotated cycles and 3792 annotated looks very similar to the respective curve for contour junctions serving as groundtruth. wedges. For the evaluation of the discrimination capabili- Figure 10b shows the F-scores for the solid wedge ties of wloc and wangle, the Receiver Operating Char- detector. For solid wedges, this detector achieves a acteristic (ROC) [9] will be used. The ROC shows score of about 0.62 at maximum. In contrast to the the quality of a detector by assigning it a point in F-scores for the contour wedge detector this score the ROC space, with the false positive rate (FPR) as is quite low. The reason for this can be found in x-coordinate and the true positive rate (TPR) as y- the fact, that there are a lot more locations to check coordinate. Therefore, the point assigned to an opti- since every skeleton junction is considered. Espe- mal detector has the coordinates (0,1). As measure cially when junctions are located next to each other, for the overall performance of a discriminator func- false hits occur frequently. The best score is achieved tion, the F-score is given [24]. for tsolid = 0.68 and t loc angle = 0.56. (a) (b) (a) (b) Figure 10: F-scores for detector of (a) contour wedges and (b) solid wedges. (c) Figure 13: Receiver Operating Characteristic (ROC) space showing performance of wedge set reduc- tion strategies for test case VAT6546 (a) for con- tour wedges and (b) for solid wedges and (c) for all wedges. Figure 11: F-scores of reduction strategies for test case VAT6546. The strategies are sorted in descend- ing order by their performance. 5. Summary and Outlook In this work we presented an algorithm that uses bitmap tracing and skeletonization as intermediate steps to detect wedge impressions in raster graphics of cuneiform documents. We have shown the weak- nesses of the measures used to construct an initial wedge set and have shown how we can use conflict set reduction strategies to improve the results signif- icantly. This work is part of ongoing research on opti- cal character recognition for cuneiform characters [3] Figure 12: The extracted wedge marks of test case and used as one of many sources of wedge constella- VAT6546. tions. The presented method will allow us to perform word spotting on raster image databases as the CDLI. 4.1. Wedge Set Reduction We will also examine if statistic approaches as in [5] or [8] can be used to enhance the detection results. The reduction strategies serve to overcome the shortcomings of pure thresholding. We demon- References strate their differences with a tracing of the tablet VAT6546 [23]. It represents fractures with lines and [1] C. Barber, D. Dobkin, and H. Huhdanpaa. The Quickhull Algorithm for Convex Hulls. ACM shows 215 contour and 120 solid wedges. Figure 11 Transaction on Mathematical Software (TOMS), shows the F-scores for this case. The best F-score 22(4):469–483, 1996. 3 is achieved by the contour-fill-sides-loc method with [2] H. Blum. A Transformation of Extracting New De- 86% (Figures 12 and 11). scriptors of Shape. In Models for the Perception of Figure 13 compares the strategies concerning TPR Speech and Visual Form, pages 362–380. MIT Press, and FPR. It shows a clear ordering between simi- 1967. 3 lar methods that differ merely in the measure for the [3] B. Bogacz, J. Massa, and H. Mara. Homogeniza- sorting of the wedges. tion of 2D & 3D Document Formats for Cuneiform Script Analysis. In Proc. of the 3rd International [17] S. Kumar, D. Snyder, D. Duncan, J. Cohen, Workshop on Historical Document Imaging and and J. Cooper. Digital Preservation of Ancient Processing (HIP15), 2015. 8 Cuneiform Tablets Using 3D-Scanning. In Proceed- [4] B. Braden. The Surveyor’s Area Formula. The Col- ings of Fourth International Conference on 3-D Dig- lege Mathematics Journal, 17(4):326–337, 1986. 5 ital Imaging and Modeling, pages 326–333, 2003. 1 [5] M. Cammarosano, G. Müller, D. Fisseler, and F. We- [18] M. Kuntz. Clustering SVG Shape. In 8th Inter- ichert. Schriftmetrologie des Keils: Dreidimension- national Conference on Scalable Vector Graphics, ale Analyse von Keileindrücken und Handschriften. 2010. 2 Die Welt des Orients, 44(1):2–36, 2014. 8 [19] D. Lee. Medial Axis Transformation of a Planar Shape. IEEE Transactions on Pattern Analysis and [6] R. Cosgriff. Identification of Shape. ASTIA AD 254 Machine Intelligence (TPAMI), 4(4):363–369, 1982. 792 820-11, Ohio State University Research Foun- 3 dation, 1960. 2 [20] H. Mara and S. Krömker. Vectorization of 3D- [7] E. Di Sciascio, F. Donini, and M. Mongiello. A Characters by Integral Invariant Filtering of High- Knowledge Based System for Content-based Re- Resolution Triangular Meshes. In 12th International trieval of Scalable Vector Graphics Documents. In Conference on Document Analysis and Recognition Proceedings of the 2004 ACM Symposium on Ap- (ICDAR), pages 62–66, 2013. 2 plied Computing, pages 1040–1044, 2004. 2 [21] H. Mara, S. Krömker, S. Jakob, and B. Breuckmann. [8] D. Edzard. Keilschrift. In ˘Ia... - Kizzuwata, volume 5 GigaMesh and Gilgamesh – 3D Multiscale Integral of Reallexikon der Assyriologie und vorderasiatis- InvariantCuneiform Character Extraction. In Pro- chen Archäologie, pages 545–567. de Gruyter, 1980. ceedings of the 11th International Symposium on 8 Virtual Reality, Archaeology and Cultural Heritage [9] T. Fawcett. An Introduction to ROC Analysis. Pat- (VAST), 2010. 2 tern Recognition Letters, 27(8):861–874, 2006. 7 [22] U. Montanari. Continuous Skeletons from Digitized [10] D. Fisseler, F. Weichert, G. Müller, and M. Cam- Images. Journal of the ACM (JACM), 16(4):534– marosano. Towards an interactive and automated 549, 1969. 3 script feature analysis of 3D scanned cuneiform [23] O. Neugebauer, editor. Register, Glossar, Nachträge, tablets. In The 4th Conference on Scientific Com- Tafeln, volume 2 of Mathematische Keilschrift- puting and Cultural Heritage (SCCH), pages 1–10, Texte. Springer, 1935. 1, 7, 8 2013. 2 [24] D. Powers. Evaluation: From precision, recall and [11] D. Fisseler, F. Weichert, G. Müller, and M. Cam- f-measure to roc, informedness, markedness & cor- marosano. Extending Philological Research with relation. Journal of Machine Learning Technologies, Methods of 3D Computer Graphics Applied to Anal- 2(1):37–63, 2011. 7 ysis of Cultural Heritage. In 12th Eurograph- [25] F. Preparata. The Medial Axis of a Simple Polygon. ics Workshop on Graphics and Cultural Heritage In Mathematical Foundations of Computer Science (GCH), pages 165–172, 2014. 1 1977, pages 443–450. Springer, 1977. 3 [12] J. Gravesen. Adaptive Subdivision and the Length [26] W. von Soden. The ancient Orient: an introduction and Energy of Bézier Curves. Computational Ge- to the study of the ancient Near East. Wm. B. Eerd- ometry, 8(1):13–31, 1997. 3 mans Publishing Co., 1994. 1 [13] H. Hameeuw and G. Willems. New Visualiza- [27] L. Watkins and D. Snyder. The Digital Hammurabi tion Techniques for Cuneiform Texts and Sealings. Project. In Proceedings of Museums and the Web Akkadica, 132(2):163–178, 2011. 1 (MW), 2003. 1 [14] S. Jakob. Die mittelassyrischen Texte aus Tell [28] G. Willems, F. Verbiest, W. Moreau, H. Hameeuw, Chu¯era in Nordost-Syrien, volume 3 of Ausgrabun- K. Van Lerberghe, and L. Van Gool. Easy and Cost- gen in Tell Chu¯era in Nordost-Syrien. Harrassowitz, Effective Cuneiform Digitizing. In The 6th Inter- 2009. 7 national Symposium on Virtual Reality, Archaeology [15] J. Kantel, P. Damerow, S. Köhler, and and Cultural Heritage (VAST), pages 73–80, 2005. C. Tsouparopoulou. 3D-Scans von Keilschrifttafeln 1 – ein Werkstattbericht. In 26. DV-Treffen der [29] C. Zahn and R. Roskies. Fourier Descriptors for Max-Planck-Institute, pages 41–62. Gesellschaft für Plane Closed Curves. IEEE Transactions on Com- wissenschaftliche Datenverarbeitung, 2010. 1 puters (TC), 21(3):269–281, 1972. 2 [16] D. Kirkpatrick. Efficient Computation of Continu- ous Skeletons. In Proceedings of the 20th Annual IEEE Symposium on Foundations of Computer Sci- ence, pages 18–27, 1979. 3 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 2D tracking of Platynereis dumerilii worms during spawning Daniel Pucher, Walter G. Kropatsch, Nicole M. Artner Pattern Recognition and Image Processing (PRIP) Vienna University of Technology, Austria http://www.prip.tuwien.ac.at Stephanie Bannister , Kristin Tessmar-Raible Max F. Perutz Laboratories University of Vienna, Austria https://www.mfpl.ac.at/ Abstract. Platynereis dumerilii are marine worms that reproduce by external fertilisation and exhibit particular swimming behaviours during spawning. In this paper we propose a novel worm tracking approach that enables the 2D tracking and feature extraction during the spawning process of these worms. The gathered data will be used in the future to characterise and compare male and female spawning behaviours. Figure 1. Image of a male (red) and female (yellow) worm. 1. Introduction Platynereis dumerilii are marine polychaete worms come into close contact and sense chemical worms (Lophotrochozoa, annelida, nereididae), pheromones secreted into the water by the opposite which swim only when sexually mature, in order sex. This is accompanied by a noticeable change to reproduce. The timing of reproductive spawning in swimming behavior for both sexes: swimming events in this species is synchronized with the moon speeds increase (particularly for males), and worms phase, whereby spawning in nature occurs primarily either begin to swim in circles, or swim in tighter during new moon. This together with chemical circles (particularly for females). Other changes in pheromone signaling allows mature male and female the plane of swimming are more frequently observed worms to locate one another and engage in spawning in both sexes during engaged spawning behavior. behaviors that constitute the nuptial dance. See During gamete release, sperm and eggs are secreted Figure 1 for an image of a male and female worm. into the water, which particularly for female worms, results in a dramatic change in body area, length and The spawning behaviors of male and female overall shape. The time individual spawning phases worms are important for successful fertilization take varies and depends on the worms and their of the gametes. The spawning process consists willingness to engage. Some worm pairs are better of four general phases: Pre-spawning, engaged matches than others which can result in shorter spawning, gamete release and post-spawning. spawning phases. During pre-spawning, male and female worms typically swim independently of one another, usually Our goal is to analyse these spawning behaviours in a with lower speeds, and display a linear body shape. quantitative manner, and to characterise and compare Engaged spawning is initiated when male and female male and female-specific spawning behaviours. 2. Task formulation of the worm during the different spawning phases. The gross curvature of the worm’s The aim is to develop methods that enable the body in general provides information on the tracking of spawning worms from captured videos directionality of swimming. For example, and extract features to quantify behaviours. For the a mostly straight linear profile would be tracking, it is important that we distinguish male and indicative of linear swimming, while smoothly female worms in every frame of a captured video, curved body profiles would indicate circular label them and keep track of those labels. This paper swimming. Good resolution of finer-scale body focuses on the extraction of features for the analysis curvatures along the length of the worm is also of behaviours. The tracking task is simplified by important. For example, a linear profile with only considering videos with single worms. In several bends could indicate an acceleration order to quantify behaviours, we currently extract the of swim speed, or ’wriggling’ movements, following worm features: depending on the amplitude of the curvatures. 1. Skeleton Such wriggling movements can be seen for males when they are stopping to secrete sperm. The skeleton describes the center line of a worm Similarly, as gametes are released from the and is defined by two endpoints and an ordered tail, mapping fine-scale curvatures at the tip of list of points between them. We use the skeleton the tail could be used to map gamete release to calculate the curvature of a worm and to events, or characterize sex-specific gamete generate a normalized shape representation. release behaviours. For example, we have 2. Head position observed fast small tail flicks in males during The head position is an important feature for sperm release, and curling of the tip of the the calculation of the velocity and the worm tail in females just prior to egg release. The trajectory. We define it as an endpoint of calculation of the curvature is based on the the skeleton. The tangent of the skeleton in skeleton of the worm. this endpoint can give us information on the 6. Normalized shape orientation of the worm. To choose the right endpoint, we currently To make the comparison of different worms select it at the beginning of a video and keep (or of the same worm at different times in track of that selection. a video) easier, we create normalized shape representations. To do this we follow a recent 3. Velocity strategy which is known as co-registration As the swimming speeds increase for both where shapes are first straightened or flattened sexes, the velocity is a good indication for the to then register different views/deformations of beginning of the engaged spawning. the same normalized shape [1]. 4. Trajectory of the worm head 7. Length and area The mapping of the swimming trajectories gives During the gamete release phase the body length us information on the interaction between two and area changes, especially for female worms. worms. Furthermore, for individual worms, the Therefore, these features are a good indicator curvature of the trajectory can be compared to for the beginning of this phase. the curvature of the worm. A high correlation indicates a circular movement and increases the 3. Existing tracking approaches robustness of the curvature estimation. The trajectory can also give an indication on where The tracking of animals and the extraction of we can expect the worm to be in a following features to quantify behaviours is not a new field frame. of application. Caenorhabditis elegans (C. elegans) are roundworms that have been used as model 5. Curvature systems in neuroscience for years and the demand Measurements of body curvature tell us both for robust computational methods has lead to a about the gross and fine body movements number of different tracking systems like Nemo [10], OptoTracker [8] or a tracking system developed environment in the lab. The single camera setup by Chatenay and Schafer [2]. These worm trackers has some limitations regarding 3D movements of the are capable of tracking worms and extracting a worms, as they might conceal parts of their body variety of different features. Unfortunately they from the cameras viewpoint, resulting in a flawed are developed for C. elegans worms who differ in representation. Analysis of spawning videos have their appearance as well as their locomotion from shown, that the worms move horizontally near the Platynereis dumerilii. Furthermore some of them are water surface. Therefore, we decided to use this only capable of tracking single worms and others single camera setup and neglect the few cases where terminate the tracking of animals if they collide the gathered data is flawed due to 3D movement. and assigns new tracks after they separate again. Although, we might change the setup in the future This does not guaranty a continuous trajectory of a using three cameras instead of one to solve the issue single worm for a whole video sequence, which is with the 3D movement. an important requirement for our behaviour analysis. Other animal tracking projects like AnTracks 5. Segmentation and tracking (www.antracks.org) or ”Visual Ants Tracking” by Basically, male and female worms can be Ying [11] are capable of tracking animals, but do not distinguished by their color and anterior / posterior allow the extraction of features, which match our segment border, which can be seen in Figure 3. requirements. Therefore, we propose a new system that is capable of tracking Platynereis dumerilii worms and offers feature extraction including a new method to compute normalized shape forms. 4. Experimental setup The setup of our worm tracker consists of a light-tight box, a mounted infrared camera and an ordinary PC to capture the videos. The worms are placed inside a spherical bowl we refer to as arena. Figure 2 shows the arena with two worms. Figure 3. Image of a female (top) and a male (bottom) worm with their segment borders (Scale in cm). The segment border divides a worm into a head and a tail part and the position of the border is different for male and female worms. Relative to their whole body length, male worms have a longer tail than female worms. Therefore, the segment border is closer to the head. Unfortunately, the segment border is not always clearly visible. Figure 4 shows three frames of the same worm in the same video just a few seconds apart. These Figure 2. Image of the arena with two worms taken from frames illustrate the problem with the segment a captured video. border. The worms tend to turn sideways when moving fast and in such cases the segment border The camera takes videos at a size of 1280x960 is not visible to the camera. This prevents us from pixels with 60 frames per second. The infrared using the segment border as a feature to distinguish camera is important as the spawning in nature male and female worms. occurs at night and we want to reproduce this one region for a single worm. Unfortunately, as the worm produces some noise when moving in the arena (particles or bubbles in the water, reflections on the edge of the arena) we also get some noise Figure 4. Three different frames of a single worm taken in our binary image. Therefore, we only consider from the same video just a few seconds apart. In the first regions whose area is above a given threshold as frame on the left the worm turned sideways, therefore the worms. As the regions generated by noise are very segment border is not visible. small, this approach works very well in our current setup. Furthermore, due to the infrared capture, we do 6. Feature extraction not have color information in the captured videos and the available grayvalues are not distinctive Features are extracted for every frame of the enough to distinguish male and female worms. captured video and are based on the binary region and/or the skeleton of a worm. Therefore, we choose an approach that does not 6.1. Skeleton rely on the shape and color of the worms, but on their Given the binary region of the worm, we use continuous motion over time. First, we label the morphological thinning to compute the skeleton. worms at the beginning of a captured video. Then, In our case this approach is superior to the we calculate the distance between the head positions morphological skeletonization with the medial in consecutive frames and assign the label based axis transfrom algorithm as the latter tends to on the smaller deviation. This approach already generate more spurious branches. See Figure 5 for works well for single worms, but it is too simple to a comparison between the two approaches for a track pairs of worms, as they tend to overlap and sample worm. The thinning approach also tends to the distance of head positions alone is not a robust create a smoother skeleton. criteria. In this paper, we focus on tracking of single worms. We will extend our approach to setups with worm pairs in the future. Although, we only track single worms at the moment, it is still possible to analyse separate spawning behaviours in male and female worms, as we add eggs or sperm manually to the arena and the worms react to them. This allows us to analyse isolated spawning behaviours. To track a single worm we first need to segment it from the background. We do this with a simple background subtraction for every frame of the video. For the subtraction, it is important that there is at least one frame at the beginning of the video with an Figure 5. Illustration of the worm skeletons (white) empty arena, which serves as the background image. computed from the binary segmentation image (outlined As this image serves as the background image for by the red line). The left skeleton was computed the whole video, it is assumed that the arena does using the morphological thinning, the right one using not move during the video. skeletonization (MAT) technique. After the background subtraction the resulting The skeleton is defined as an 8-connected curve image is converted to a binary image, based on s = hp1, ..., pni where pi = (xi, yi) with i = 1, ..., n. a global threshold. The binary image gives us a We order the points pi of the skeleton from head to collection of regions that correspond to changes in tail by comparing the endpoints of the skeleton in one relation to the empty arena. Ideally there is only frame with the endpoints in the previous frame. The position of the head in the first frame of the video has to calculate the curvature at pi. See Figure 6 for a to be specified by the user. visualization. 6.2. Head position As we store the skeleton points in a head to tail order, we get the head position from the first point p1 = (x1, y1) in s. 6.3. Trajectory of the worm head Given a list of head positions h = hh1, ..., hni where hi = (xi, yi) with i = 1, ..., n we can generate a connected trajectory. Furthermore, its curvature can be computed to provide information on the swimming direction. Figure 11 shows a section of the trajectory from a single worm video. 6.4. Velocity Velocity is defined as the rate of change of position with respect to time. We get the change of position by considering the head positions of a single worm in two consecutive frames and calculating the Euclidean distance between the two positions. The Figure 6. Illustration of the circumscribed circle (blue) for time between two frames is given by 1/f ps, where a single point pi on a skeleton. The circle passes through f ps is the number frames per second of the video every vertex of the triangle (red) formed by the points source. pi−k, pi and pi+k with k = 10. 6.5. Curvature The radius of the circumscribed circle is defined According to Hermann and Klette [6], the as radius = abc , where a, b and c correspond to 4∗area estimation of the curvature along a discrete curve the edge lengths of the triangle and area is the area can roughly be divided into three categories: the of the triangle. The area of a triangle is given by derivative of the tangent angle, the derivative of the area = abs( 1 ∗ determinant) where determinant 2 curve and the radius of the osculating circle. We refers to the determinant of the triangle matrix, which chose a method based on osculating circles as it is formed from the three triangle points: is fast and the implementation is simple. Gray [5] x 1 y1 1 defines the osculating circle of a curve C at a given determinant = x2 y2 1 point P in the continuous space as the circle that x3 y3 1 has the same tangent as C at point P as well as the same curvature. We approximate these circles with As the sign of the determinant gives an indication the circumscribed circles of triangles on the discrete on the orientation of the triangle and therefore an skeleton curve. Casey [3] defined the circumscribed indication on the direction of the curvature, we don’t circle as the unique circle that passes through each use the absolute value. So we define the area as of the triangle’s three vertices. area = 1 ∗ determinant. With this information 2 the radius is then defined as radius = abc . 2∗determinant Given the definition for the skeleton s at the The curvature is given by the inverse of the radius beginning of this section, let k be 1 ≤ k ≤ n c = 1 . As we do not take the absolute value 2 radius if n is odd and 1 ≤ k ≤ ( n − 1) if n is even, of the determinant when calculating the radius, the 2 where n is the number of points on the skeleton. curvature is a signed value that is positive if the For each point pi on s we define a triangle between curvature is on the right side and negative if the the three points pi−k, pi and pi+k. Then the radius curvature is on the left side of the skeleton curve. See of the triangles circumscribed circle is computed Figure 7 for a visualization. small curvatures are overlooked in the process if k is too big. Another problem with a fixed neighbourhood k are points at the beginning and the end of the skeleton curve. For points pa with a ≤ k there are no neighbourhood points pa−k defined, as the index would become null or negative. The same is true for points pb with b > n − k where no neighbourhood points pb+k are defined, as the index would get bigger than n. We currently solve this problem by disregarding those points on the curve. In Figure 7 the curvature values always start at the index 1 + k and end at the index n − k. Another problem is the determination of a good value for the parameter Figure 7. Image of a worm with its skeleton (top) and a k. In Figure 7 the blue line shows the curvature for plot of the estimated curvature of the worm for different k k = 17 which equals 0.15 ∗ n and gives the best (bottom). results on the tested worms. 6.6. Normalized shape An important factor in the accuracy of this algorithm is the parameter k that defines a We achieve the normalized shape representation neighbourhood around the point of interest on of a worm with a backward medial axis transform the curve. We tested the accuracy on a discrete circle approach. The starting point is the distance with a radius of 40 pixels using Bresenham’s circle transform of the binary worm image which labels algorithm. The results can be seen in Figure 8. The each pixel with the Euclidean distance to the nearest Parameter k starts at 0.05 ∗ n as the error gets too big boundary in the binary image. For every point pi for smaller values. As k increases we can see that of of sorted list s of skeleton points, we use the the error gets smaller. The same is true for a constant coordinates to look up the distances in the distance k but an increasing radius, which corresponds to the transform. Those distances then serve as the radii for multigrid convergence theorem, where we expect the the circles. See Figure 9 for a visualization. accuracy to increase as the grid resolution (or in our case the circle radius) increases [7, Chapter 10]. Figure 9. Part of the distance transform of a worm with circles drawn for four points on the skeleton. Figure 8. Plot of the avg- and max-error for the curvature estimation of a circle with radius 40 and increasing k. To get a suitable representation of the worm, the distances between the skeleton points in the video So the accuracy gets better with increasing k. frame need to stay the same on the normalized shape Unfortunately, this accuracy comes at a price, as representation. Therefore the Euclidean distance between the points is calculated and taken into 7. Single worm experiments account when drawing the circles. Figure 10 Some experiments with single female worm shows the results of this method, where in the first videos were conducted. Figure 12 shows two plots visualization only the outlines of a few circles are of smoothed worm lengths. For the smoothing a drawn to show the general idea behind this approach. moving average filter was applied to the original data. The plots show the length of the worms around the time of the gamete release where the female worms secrete their eggs into the water and get smaller and therefore shorter. This can also be observed in the plots. Figure 10. Plots of the normalized representation of a worm using only the outlines of 24 circles to visualize the general idea (top) and a complete shape visualization with all 115 filled circles (bottom) for that worm. 6.7. Length To calculate the length of a worm, we use the geodesic distance of its skeleton plus the radii of the circles at the first and last skeleton point. The circle radii are needed as our skeleton endpoints do not lie at the edge of the worm. The geodesic distance is computed using the Euclidean distance. 6.8. Area For the area of a worm, we simply calculate the sum of all foreground pixels of the binary image of the worm, which is the zeroth moment. Figure 12. Two plots of smoothed worm lengths for two different female worms right around the time of the gamete release (marked in red). Figure 13 shows how the length of a female worm changes during an entire spawning process. Annotation A marks a special case where the worm is overlapping itself resulting in a faulty binary area Figure 11. Part of a video frame with an overlay of the and skeleton. The problem here is the 3D movement head trajectory. The trajectory is taken from a short of the worm. sequence of a single worm video. Figure 13. Change of worm length over time. During the gamete release the worm gets shorter. Annotation A: Wrong skeleton due to 3D movement of the worm. Annotation B: Wrong length due to 3D movement of the worm. Another special case where the 3D movement also extracted features are flawed as well. One approach results in error-prone data is marked with annotation will be to look into the watershed method to segment B. Here the end of the tail is not visible to the camera the worms as it might be superior to the simple which makes the worm appear shorter in the video. threshold based method we use now, especially for worms that overlap. 8. Conclusion and Future work Our approach to compute the curvature also In this paper we proposed a novel worm tracking has some flaws and is not robust enough. In the approach for Platynereis dumerilii worms, that future we will look into alternative approaches to enables both tracking and feature extraction from compute the curvature of discrete curves. Other captured videos. Although our tracking approach is methods that try to estimate the osculating circles not suitable for tracking two worms in difficult cases rely on digital straight segments (DSS) recognition our methods to extract worm features already show [6] [4] and Roussillon and Lachaud [9] base their promising results. method around maximal digital circular arcs. The method we currently use to track single 9. Acknowledgement worms works for two worms if they are physically The research leading to these results has separated, but as they get close to each other or received funding from the European Research overlap, the current method might fail. In the future, Council under the European Community’s Seventh we will extend the method to consider cases where Framework Programme (FP7/2007-2013)/ERC the worms are close to each other or even overlap. Grant Agreement 337011 to KT-R. Ideas to achieve this include the comparison of more features than just the head positions of consecutive We also thank the anonymous reviewers for frames. A combination of all other features could their valuable input. yield an appropriate approach in distinguishing male and female worms. References [1] N. Aigerman, R. Poranne, and Y. Lipman. Lifted The current feature extraction is robust in most bijections for low distortion surface mappings. ACM cases, but there exist special cases, where a single Trans. Graph., 33(4):69:1–69:12, July 2014. 2 worm overlaps itself due to 3D movement in the [2] J. B. Arous, Y. Tanizawa, I. Rabinowitch, water. This results in regions and skeletons, which D. Chatenay, and W. R. Schafer. Automated do not represent the worm correctly and therefore the imaging of neuronal activity in freely behaving caenorhabditis elegans. Journal of Neuroscience Methods, 187(2):229 – 234, 2010. 3 [3] J. Casey. A sequel to the first six books of the elements of Euclid, containing an easy introduction to modern geometry, with numerous examples. 7th edition, revised and enlarged. Dublin. Hodges. University Press Series. 184 S, (1895). 5 [4] D. Coeurjolly, S. Miguet, and L. Tougne. Discrete Curvature based on Osculating Circles Estimation, 2001. 4th International Workshop on Visual Form 2001, Capri, Italy, Springer Lecture Notes in Computer Science 2059, pages 303-312. 8 [5] A. Gray. Modern Differential Geometry of Curves and Surfaces with Mathematica. CRC Press, Inc., Boca Raton, FL, USA, 1st edition, 1996. 5 [6] S. Hermann and R. Klette. A comparative study on 2d curvature estimators. In Computing: Theory and Applications, 2007. ICCTA ’07. International Conference on, pages 584–589, March 2007. 5, 8 [7] R. Klette and A. Rosenfeld. Digital Geometry. Morgan Kaufman Publishers, 2004. 6 [8] D. Ramot, B. E. Johnson, T. L. Berry, Jr, L. Carnell, and M. B. Goodman. The parallel worm tracker: A platform for measuring average speed and drug-induced paralysis in nematodes. PLoS ONE, 3(5):e2208, 05 2008. 3 [9] T. Roussillon and J.-O. Lachaud. Accurate curvature estimation along digital contours with maximal digital circular arcs . In 14-th International Workshop on Combinatorial Image Analysis (IWCIA), LNCS, pages 43–55. Springer, 2011. 8 [10] G. D. Tsibidis and N. Tavernarakis. Nemo: a computational tool for analyzing nematode locomotion. BMC Neuroscience, 8(1):1–7, 2007. 3 [11] F. Ying. Visual ants tracking. PhD thesis, University of Bristol, 2004. 3 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 Significance of Colors in Texture Datasets Milan Šulc, Jiř´ı Matas Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Cybernetics, Center for Machine Perception, Technická 2, 166 27 Praha 6, Czech Republic {sulcmila,matas}@fel.cvut.cz Abstract. This paper studies the significance of for texture recognition contain color information, we color in eight publicly available datasets commonly decided to evaluate the accuracy of color-statistics used for texture recognition through the classification based methods to measure the significance of color results of ”pure-color” and ”pure-texture” (color- information in the datasets. less) descriptors. The datasets are described using The first contribution of this paper is a study of the state-of-the-art color descriptors, Discrimina- the significance of color information in available tive Color Descriptors (DD) [15] and Color Names datasets commonly used for evaluation of texture (CN) [28]. The descriptors are based on partition- recognition methods. In total we evaluate 8 texture ing of the color space into clusters and assigning datasets, namely FMD (Flickr Material Database), the image probabilities of belonging to individual ALOT (A Lot Of Textures), KTH-TIPS (Textures clusters. We propose a simple extension of the DD under varying Illumination, Pose and Scale), KTH- and the CN descriptors, adding the standard devia- TIPS2a, KTH-TIPS2b, CUReT (Columbia-Utrecht tions of color cluster probabilities into the descrip- Reflectance and Texture), VehApp (Vehicle Appear- tor. The extension leads to a significant improvement ance) and AniTex (Animal Texture). in recognition rates on all datasets. On all datasets The second contribution of the paper is an im- the 22-dimensional improved CNσ descriptor outper- provement of the state-of-the-art color descriptors, forms all original 11-, 25- and 50-dimensional de- Discriminative Color Descriptors (DD) [15] and scriptors. Linear combination of the state-of-the-art Color Names (CN) [28]. DD and CN are based on ”pure-texture” classifier with the CNσ classifier im- partitioning of the color space into clusters and as- proves the results on all datasets. signing each color the probabilities of belonging to individual clusters. Our extension to the DD and the CN descriptors adds the standard deviation for 1. Introduction each color cluster to the descriptor. This leads to Visual recognition based on texture and color are an improvement in recognition rates on all 8 tested well established computer vision disciplines with datasets, as shown in the experiments in Section 5. several surveys available, e.g. [3, 10, 19, 20, 27, 30]. The third contribution of the paper are experi- The state-of-the-art in texture recognition has been ments combining a state-of-the-art ”pure-texture” de- recently dominated in terms of accuracy by meth- scriptor with the improved CNσ descriptor, leading to ods based on deep Convolutional Neural Networks further increase in recognition accuracy. (CNNs) [5, 6], yet the pre-CNN approaches may be The rest of the paper is organized as follows: Sec- preferable in real-time applications for their perfor- tions 2.1 and 2.2 review the state of the art in tex-mance without parallel processing. Although it has ture and color recognition, respectively. The selected been shown that several texture description meth- ”pure-color” image descriptors and our extension to ods can benefit from adding color information [13], them are introduced in Section 3. Publicly avail- a large number of the pre-CNN texture recognition able color-image databases commonly used for tex- techniques has been evaluated only on gray-scale im- ture classification are described in Section 4. Section ages. Since many publicly available datasets used 5 describes the experiments and presents the results. The observations are discussed and conclusions are simpler color statistics, not making use of spatial in- drawn in Section 6. formation. Standard approaches to collect color information 2. State of the Art include color histograms (based on different color representations), color moments and moment invari- 2.1. Texture-Based Classification ants. Sande et al. [27] provide an extensive eval- A large number of texture recognition techniques uation of such descriptors. The Color Names (CN) has been proposed, many of them being described in descriptor by Weijer et al. [28] is based on models the surveys [3, 19, 20, 30]. In this section we only learned from real-world data obtained from Google review the recent development and the state-of-the- by searching for 11 color names in English. The art. Color Names have shown to be a successful color at- Several recent texture recognition algorithms re- tribute for object detection [12] and recognition [14]. port excellent results on standard datasets while ig- The model assigns each pixel the probability of be- noring the available color information. A number longing to one of the 11 color clusters. A similar ap- of them is based on the popular Local Binary Pat- proach is used by the Discriminative Color Descrip- terns, such as the Pairwise Rotation Invariant Co- tor (DD) of Khan et al. [15], where the color values occurrence Local Binary Pattern of Qi et al. [22] or are clustered together based on their discriminative the Fast Features Invariant to Rotation and Scale of power in a classification problem with the objective Texture of Sulc and Matas [26]. A cascade of in- to minimize the drop of mutual information of the variants computed by scattering transforms was pro- final representation. posed by Sifre and Mallat [24] in order to construct Khan et al. [13] study the strategies of com- an affine invariant texture representation. Mao et al. bining color and texture information. They carried [18] use a bag-of-words model with a dictionary of out a comparison of pure color descriptors on the so called active patches: raw intensity patches that publicly available KTH-TIPS2a, KTH-TIPS2b, and undergo further spatial transformations and adjust FMD datasets, and on another small dataset denoted themselves to best match the image regions. While as Texture-10. Since the results of Color Names the Active Patch Model doesn’t use color informa- and Discriminative Color Descriptors outperformed tion, the authors claim that adding color will fur- other color descriptors in texture classification, we ther improve the results. Cimpoi et al. [4], us- will describe the usage of CN and DD in more detail ing Improved Fisher Vectors (IFV) for texture de- in Section 3 and use the models in our experiments scription, show further improvement when combined in Section 5. with describable texture attributes learned on the De- scribable Textures Dataset (DTD) and with color at- 3. Selected Color Descriptors tributes. Based on the findings of Khan et al. [13] and on Recently, Cimpoi et al. [5, 6] pushed the state- our preliminary results, we consider the Color Names of-the-art in texture recognition using a new encoder [28] and Discriminative Color Descriptors [15] the denoted as FV-CNN-VD, obtained by Fisher Vector best match for our experiments for their superior pooling of a very deep Convolutional Neural Net- classification accuracy. work (CNN) filter bank of Simonyan and Zisser- While each of the approaches creates the color man [25]. The CNN filter bank operates on (pre- models based on a different criteria, the result is processed) RGB images. The method achieves state- a soft assignment of clusters to each RGB value. of-the-art accuracy, yet may not be suitable for real- In both cases the assignment is performed using time applications when evaluated without a high- a lookup table, which creates a mapping from performance GPU. RGB values to probabilities over C clusters ci, i.e. p (c 2.2. Color Statistics for Classification i | x). In this work we use the lookup tables pro- vided by the authors of the methods, i.e. the 11- Color information is processed by many state-of- dimensional Color Names representation by [28] and the-art descriptors in Computer Vision, including the the universal color 11-, 25- and 50-dimensional rep- neurocodes of Deep CNNs or different extensions of resentations by [15]. SIFT incorporating color. Yet we are interested in The models assume uniform prior over the color names p(ci). The conditional probabilities for each 4. Color Texture Datasets cluster ci given an image I are computed as an aver- This section reviews publicly available texture age over all N pixels xn in the region: datasets that contain color information. Databases 1 available only in the gray-scale version, such as Bro- X p (ci | I) = p (c N i | xn) (1) datz, UIUCTex or UMD, are omitted. xn∈I 4.1. CUReT The standard descriptor D for image I is then a vec- tor containing the probability of each cluster: The Columbia-Utrecht Reflectance and Texture (CUReT) image database [8] commonly used for tex-  p (c  1 | I ) ture recognition1 contains 5612 images of 61 classes.  p (c2 | I )  There are 92 images per class, with different combi- D(I) =  .  (2)  . nations of view- and illumination-direction. .    p (c The standard experimental protocol divides the C | I ) dataset into two halves, using 46 training images per We propose to add another statistics to the color class for training and 46 images for testing. Exam- descriptor, the standard deviation of the color cluster ples of four selected classes from the dataset are dis- probabilities in the image: played in Figure 1. 4.2. KTH-TIPS s 1 X σ(c The Textures under varying Illumination, Pose i | I ) = [p (c N i | xn) − p (ci | I )]2 and Scale (KTH-TIPS) database [9, 11] was col- xn∈I (3) lected by Fritz, Hayman and Caputo with the aim We concatenate the standard deviations to the to supplement the CUReT database, concerning tex- original descriptor to get the extended representation: ture variations in real-world conditions. The dataset contains 81 images for each of 10 selected materials,  p (c  taken with different combination of pose, illumina- 1 | I ) tion and scale. The dataset contains samples of dif-  p (c2 | I )   .  ferent color for several materials, each of the samples  ..    appears several times. In the experimental protocol p (c  Dσ(I) =  C | I ) the dataset is randomly divided into halves, 40 im-   (4)  σ(c1|I )    ages per class are used for training and the remaining  σ(c2|I )   41 images are used for testing. It is thus probable, .   .   .  that each of the samples appear in the training data σ(cC|I) set. 4.3. KTH-TIPS2 The KTH-TIPS2 database [2, 17], gathered by Mallikarjuna, Targhi, Hayman and Caputo, largely followed the procedure used for the previous KTH- TIPS database, with some differences in scale an il- lumination. The database also contains images from the previous KTH-TIPS dataset. The objective of the database is to provide a better means of evaluation: It contains 4 physical samples for each of 11 materials (a) Felt (b) Polyester (c) Lettuce (d) Corn and images of no physical sample are present in both leaf husk training and test set. The database contains 108 im- Figure 1: Examples of four texture classes from the ages of each physical sample. There are two version CUReT database. of the database: KTH-TIPS2a and KTH-TIPS2b. In 1http://www.robots.ox.ac.uk/ vgg/research/texclass/setup.html (a) Fabric (b) Foliage (c) Glass (d) Stone Figure 4: Examples of four texture classes from the FMD database. ber of materials is much higher: it contains 250 tex- ture classes, 100 images per class. The pictures were (a) Corduroy (b) Lettuce (c) Wood (d) Wool taken under various viewing and illumination direc- tions and illumination colors. For evaluation, 20 im- Figure 2: Examples of four texture classes from the ages per class are used for training and the remaining KTH-TIPS2 database. Each image belongs to a dif- 80 images per class are used for testing. Examples ferent physical sample. from the ALOT database are displayed in Figure 3. 4.5. FMD the KTH-TIPS2a dataset, 144 images are are miss- ing (namely there are four samples with only 72 im- The Flickr Material database (FMD) was devel- ages). In the experimental protocol, three samples oped by Sharan et al. [23] with the intention of cap- from each class form the training set and the remain- turing a range of real world appearances of com- ing sample is used for testing. In the case of the mon materials. The dataset contains 1000 images KTH-TIPS2b dataset, one sample forms the training downloaded manually from Flickr.com (under Cre- set and the remaining three form the test set. Exam- ative Commons license), belonging to one of the fol- ples from all four samples of four selected classes lowing materials: Fabric, Foliage, Glass, Leather, from the database are displayed in Figure 2. Metal, Paper, Plastic, Stone, Water or Wood. There are exactly 100 images for each of the 10 material 4.4. ALOT classes. Unlike the dataset described above, FMD The Amsterdam Library of Textures (ALOT) [1] is was not primarily created for texture recognition, and similar in spirit to the CUReT dataset, yet the num- it includes images of objects with various textures for each material. The dataset also includes binary masks for background segmentation. The standard evaluation protocol divides the images in each class into two halves, 50 images for training and 50 for testing. Examples from the FMD dataset are dis- played in Figure 4. 4.6. AniTex The Animal Texture dataset (AniTex) constructed (a) Fruit (b) Pepper (c) Color (d) Macaroni by Mao et al. [18] contains 3120 texture patch im- sprinkles (red) calibration ages cropped randomly from the torso regions in- checker side the silhouettes of different animals in the PAS- CAL VOC 2012 database. There are only 5 classes Figure 3: Examples of four texture classes from the (cat, dog, sheep, cow and horse), 624 images each. ALOT database. The authors created the dataset to explore less ho- (a) Cat (b) Dog (c) Sheep (d) Cow (a) Plane (b) Bicycle (c) Bus (d) Car Figure 5: Examples of four texture classes from the Figure 6: Examples of four texture classes from the AniTex database. VehApp database. mogeneous texture and appearance than available in The multiclass classification is then performed for standard texture datasets. The patches in the dataset each descriptor separately by combining binary SVM come from images under different conditions such as classifiers in a One-vs-All scheme. Linear SVM scaling, rotation, viewing angle variations and light- classifiers were used together with an approximate ing condition change. For evaluation, the dataset is feature map of Vedaldi and Zisserman [29]. The randomly divided into 2496 training and 624 testing χ2 kernel approximations and the histogram inter- images. Examples from the AniTex dataset are dis- section kernel approximations were considered, the played in Figure 5. latter was chosen based on slightly superior perfor- mance in preliminary experiments. The Platt’s prob- 4.7. VehApp abilistic output [16, 21] was used in order to estimate The Vehicle Appearance dataset (VehApp) was the posterior class probabilities to choose the result created by the same authors as AniTex [18] with the in the One-vs-All scenario. To minimize the effect of same intentions. It contains 13723 images cropped the random splits into training and testset, each ex- from PASCAL VOC images containing vehicles of periment is performed 10 times on a different split, 6 classes (aeroplane, bicycle, car, bus, motorbike, with the exception of the KTH-TIPS2 databases with train). The images are evaluated in a way similar to 4 experiments based on the material samples. AniTex: 80% images are randomly chosen into the All 8 color descriptors are compared in terms of training set, the remaining 20% is used for testing. class recognition accuracy in Table 1. The best pub- Examples from the VehApp dataset are displayed in lished results of ”pure-texture” (color-less) methods Figure 6. and the results of the state-of-the-art FV-CNN [5] method are attached to the table for comparison. 5. Experiments The comparison of the best ”pure-color” and ”pure- We compute 8 descriptors for each image in every texture” results on all 8 datasets is illustrated in Fig- database: the standard 11-dimensional Color Name ure 7. descriptor CN and our extended 22-dimensional ver- An experiment on combining efficient classifiers sion CNσ; the 11-, 25- and 50- Discriminative Color of ”pure-texture” and ”pure-color” was performed as Descriptors DD11, DD25, DD50 and the extended follows: Each image was described using the CNσ versions DD11σ, DD25σ, DD50σ of double dimen- color descriptor (using the same method as above) sionality. and the Ffirst [26] texture descriptor (with nconc = 3 Table 1: Recognition accuracy of selected color descriptors on publicly available databases commonly used for texture recognition. CUReT TIPS TIPS2a TIPS2b ALOT FMD AniTex VehApp # classes 61 10 11 11 250 10 5 6 CN 85.9±0.6 99.3±0.9 46.7±2.0 39.0±2.5 51.0±0.5 26.3±2.4 38.0±2.0 34.7±1.0 DD11 68.7±0.9 95.5±1.3 43.5±6.5 36.1±1.0 38.2±0.4 24.0±1.1 32.4±1.6 33.2±1.0 DD25 83.4±0.8 96.8±0.9 44.0±7.6 36.0±2.3 60.9±0.5 23.9±1.4 36.0±1.7 36.9±0.6 DD50 87.7±1.0 99.0±0.7 46.9±4.8 38.5±1.5 65.5±0.4 22.6±1.4 37.4±1.1 39.1±1.0 CNσ 94.2±0.6 99.8±0.3 51.7±5.7 42.6±1.4 73.9±0.5 28.0±2.2 41.7±1.8 39.1±0.7 DD11σ 81.9±0.8 97.6±1.0 48.5±3.8 38.3±1.9 60.1±0.5 22.7±1.6 35.9±2.1 35.8±0.5 DD25σ 88.9±0.7 99.4±0.3 49.1±3.7 39.9±4.5 75.0±0.5 23.9±1.1 39.9±1.6 39.3±0.7 DD50σ 91.0±0.7 99.6±0.2 53.2±4.6 42.0±2.8 78.0±0.5 25.3±1.7 38.9±0.8 41.2±0.9 FV-CNN[5] 99.0±0.2 – – 81.8±2.5 98.5±0.1 79.8±1.8 – – Pure-texture 99.8±0.1[24] 99.7±0.1[4] 88.2±6.7[26] 76.0±2.9 [26] 95.9±0.5 [26] 57.4±1.7[22] 50.8[18] 63.4[18] descriptors per image, each describing c = 7 con- opinion pool and the PROD scheme represents a log- secutive scales). An approximate intersection kernel arithmic opinion pool. map is applied to both color and texture descriptors, which are then classified using the One-vs-All Sup- 6. Observations and Conclusions port Vector Machines with Platt’s probabilistic out- A set of experiments with color-based image de- puts. The final scores in Table 2 were then combined scriptors was performed on 8 datasets commonly using 3 axiomatic approaches, denoted as: used for texture classification, leading to interesting 1. PROD: The dot product of both of the scores is insights in color-based classification and in the un- used for final decision. derstanding of available texture-recognition datasets. 2. SUM: The sum of both of the scores is used for One can see that using the simple color descrip- final decision. tors is sufficient for excellent results in specific cases, such as the KTH-TIPS dataset, where materials of 3. SUM0.3: The weighted sum of both of the the same color appear in both training and test data. scores is used for final decision, where the Satisfying results can also be obtained on the CUReT weight of color is only 30% of the weight of tex- and ALOT datasets. The KTH-TIPS2a and KTH- ture, taking into account the lower performance TIPS2b datasets are more difficult for ”pure-color” of the color descriptors on most datasets. classification, since testing data may come from sam- In terms of combining probability distributions [7], ples of different colors than training data, as illus- the SUM and SUM0.3 schemes represent a linear trated in Figure 2. The FMD, AniTex and VehApp 100 99.8 99.8 99.7 94.2 95.9 Color 88.2 Texture 80 [%] 78.0 76.0 60 63.4 57.4 accuracy 51.7 50.8 40 42.6 41.7 41.2 28.0 20 Recognition 0 CUReT KTH-TIPS KTH-TIPS2a KTH-TIPS2b ALOT FMD AniTex VehApp Figure 7: Comparison of the best published results of ”pure-texture” descriptors and the best results obtained using ”pure-color” descriptors. Table 2: Recognition accuracy for combinations of ”pure-texture” (Ffirst) and ”pure-color” (CNσ) descriptors. CUReT TIPS TIPS2a TIPS2b ALOT FMD AniTex VehApp # classes 61 10 11 11 250 10 5 6 CNσ 94.24±0.60 99.83±0.31 51.73±5.71 42.64±1.43 73.86±0.46 27.98±2.20 41.67±1.77 39.07±0.67 Ffirst 99.65±0.09 99.51±0.53 88.29±6.77 76.60±4.29 96.43±0.23 50.22±1.90 45.72±1.78 54.41±0.66 PROD 99.41±0.15 99.98±0.08 68.13±5.06 60.12±4.06 94.65±0.20 46.58±2.37 49.97±1.50 56.47±0.76 SUM 99.04±0.20 100.00±0.00 77.59±5.87 60.35±5.13 92.06±0.29 45.70±2.47 50.08±1.56 56.56±0.98 SUM0.3 99.68±0.12 99.85±0.26 88.76±6.40 77.17±4.23 97.05±0.14 52.24±1.68 48.99±1.83 56.62±0.92 datasets are quite difficult for their heterogeneous na- Acknowledgements ture, both in terms of texture and color. Yet the color Milan Šulc was supported by CTU student statistics might still provide useful information when grant SGS15/155/OHK3/2T/13, Jiř´ı Matas by combined with other descriptors. The Czech Science Foundation Project GACR P103/12/G084. An extension to the Color Names (CN) and Dis- criminative Color Descriptors (DD) has been pro- References posed (denoted as CNσ, DDσ), significantly im- [1] G. J. Burghouts and J.-M. Geusebroek. Material- proving the recognition accuracy on all 8 tested specific adaptation of color invariant features. Pat- datasets. The comparison of Color Names (CN) and tern Recognition Letters, 30(3):306–313, 2009. 4 Discriminative Color Descriptors (DD) descriptors [2] B. Caputo, E. Hayman, and P. Mallikarjuna. Class- brings a surprising observation: on 6 out of the 8 specific material categorisation. In Proc. of IEEE In- texture datasets, Color Names outperform even the ternational Conference on Computer Vision (ICCV), higher-dimensional Discriminative Color Descrip- volume 2, pages 1597–1604. IEEE, 2005. 3 tors DD25, although the opposite may be expected [3] C.-h. Chen, L.-F. Pau, and P. S.-p. Wang. Handbook of pattern recognition and computer vision. World from the findings on different tasks [15]. The im- Scientific, 2010. 1, 2 proved CNσ outperforms other ”pure-color” descrip- [4] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and tors on 5 out of 8 datasets, the best results on the A. Vedaldi. Describing textures in the wild. In remaining 3 datasets are achieved by the improved Computer Vision and Pattern Recognition (CVPR), DD50σ descriptor. 2014 IEEE Conference on, pages 3606–3613. IEEE, 2014. 2, 6 Combining a state-of-the-art ”pure-texture” clas- [5] M. Cimpoi, S. Maji, I. Kokkinos, and A. Vedaldi. sifier [26] with the ”pure-color” classifier of CNσ Deep filter banks for texture recognition, de- leads to an improvement on all 8 tested datasets. The scription, and segmentation. arXiv preprint arXiv:1507.02620, 2015. 1, 2, 5, 6, 7 weights of the classifiers in the combination should [6] M. Cimpoi, S. Maji, and A. Vedaldi. Deep filter be set according to the classifiers performance. Note banks for texture recognition and segmentation. In that by combining the classifiers a 100% accuracy Proceedings of the IEEE Conference on Computer was achieved on the KTH-TIPS. Significant im- Vision and Pattern Recognition, pages 3828–3836, provements are also achieved on the AniTex and Ve- 2015. 1, 2 hApp databases, where [26] performs rather poorly. [7] R. T. Clemen and R. L. Winkler. Combining proba- bility distributions from experts in risk analysis. Risk The state-of-the-art ”pure-texture” and ”pure- analysis, 19(2):187–203, 1999. 6 color” classifiers and their combinations obtain ex- [8] K. J. Dana, B. Van Ginneken, S. K. Nayar, and J. J. cellent results on simpler texture-recognition prob- Koenderink. Reflectance and texture of real-world surfaces. ACM Transactions on Graphics (TOG), lems. They are outperformed by the recent FV-CNN 18(1):1–34, 1999. 3 model [5] in the more difficult tasks. Yet the low [9] M. Fritz, E. Hayman, B. Caputo, and J.-O. Eklundh. computational complexity of some ”pure-texture” The kth-tips database, 2004. 3 and ”pure-color” descriptors is beneficial and their [10] T. Gevers, A. Gijsenij, J. Van de Weijer, and J.-M. performance may be still interesting for future works, Geusebroek. Color in computer vision: fundamen- e.g. when used in a cascade classification scheme tals and applications, volume 23. John Wiley & and followed by FV-CNN in case of ambiguity. Sons, 2012. 1 [11] E. Hayman, B. Caputo, M. Fritz, and J.-O. Ek- [25] K. Simonyan and A. Zisserman. Very deep convo- lundh. On the significance of real-world conditions lutional networks for large-scale image recognition. for material classification. In Computer Vision– arXiv preprint arXiv:1409.1556, 2014. 2 ECCV 2004, pages 253–266. Springer, 2004. 3 [26] M. Šulc and J. Matas. Fast features invariant to ro- [12] F. S. Khan, R. M. Anwer, J. van de Weijer, A. D. tation and scale of texture. In L. Agapito, M. M. Bagdanov, M. Vanrell, and A. M. Lopez. Color at- Bronstein, and C. Rother, editors, Computer Vision– tributes for object detection. In Computer Vision ECCV 2014 Workshops, Part II, volume 8926 of and Pattern Recognition (CVPR), 2012 IEEE Con- LNCS, pages 47–62, Gewerbestrasse 11, CH-6330 ference on, pages 3306–3313. IEEE, 2012. 2 Cham (ZG), Switzerland, September 2015. Springer [13] F. S. Khan, R. M. Anwer, J. van de Weijer, M. Fels- International Publishing AG. 2, 5, 6, 7 berg, and J. Laaksonen. Compact color–texture de- [27] K. E. Van De Sande, T. Gevers, and C. G. Snoek. scription for texture classification. Pattern Recogni- Evaluating color descriptors for object and scene tion Letters, 51:16–22, 2015. 1, 2 recognition. PAMI, 32(9):1582–1596, 2010. 1, 2 [14] F. S. Khan, J. Van de Weijer, and M. Vanrell. Mod- [28] J. Van De Weijer, C. Schmid, J. Verbeek, and D. Lar- ulating shape features by color attention for object lus. Learning color names for real-world applica- recognition. International Journal of Computer Vi- tions. Image Processing, IEEE Transactions on, sion, 98(1):49–64, 2012. 2 18(7):1512–1523, 2009. 1, 2 [29] A. Vedaldi and A. Zisserman. Efficient additive ker- [15] R. Khan, J. Van de Weijer, F. Shahbaz Khan, nels via explicit feature maps. PAMI, 34(3), 2011. D. Muselet, C. Ducottet, and C. Barat. Discrimina- 5 tive color descriptors. In Computer Vision and Pat- tern Recognition (CVPR), 2013 IEEE Conference [30] J. Zhang and T. Tan. Brief review of invariant texture on, pages 2866–2873. IEEE, 2013. 1, 2, 7 analysis methods. Pattern recognition, 35(3):735– 747, 2002. 1, 2 [16] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on platts probabilistic outputs for support vector ma- chines. Machine learning, 68(3), 2007. 5 [17] P. Mallikarjuna, M. Fritz, A. Targhi, E. Hayman, B. Caputo, and J. Eklundh. The kth-tips and kth- tips2 databases. http://www.nada.kth.se/ cvap/databases/kth-tips, 2006. 3 [18] J. Mao, J. Zhu, and A. L. Yuille. An active patch model for real world texture and appearance clas- sification. In Computer Vision–ECCV 2014, pages 140–155. Springer, 2014. 2, 4, 5, 6 [19] M. Mirmehdi, X. Xie, and J. Suri. Handbook of tex- ture analysis. Imperial College Press, 2009. 1, 2 [20] M. Pietikäinen. Texture recognition. Computer Vi- sion: A Reference Guide, pages 789–793, 2014. 1, 2 [21] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3), 1999. 5 [22] X. Qi, R. Xiao, C.-G. Li, Y. Qiao, J. Guo, and X. Tang. Pairwise rotation invariant co-occurrence local binary pattern. PAMI, 36(11):2199–2213, 2014. 2, 6 [23] L. Sharan, R. Rosenholtz, and E. Adelson. Mate- rial perception: What can you see in a brief glance? Journal of Vision, 9(8):784–784, 2009. 4 [24] L. Sifre and S. Mallat. Rotation, scaling and de- formation invariant scattering for texture discrimi- nation. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1233– 1240. IEEE, 2013. 2, 6 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 A Novel Concept for Smart Camera Image Stitching Majid Banaeyan∗ , Hanna Huber∗ , Walter G. Kropatsch∗ and Raphael Barth+ Vienna University of Technology ∗Pattern Recognition and Image Processing group {majid,hanna,krw}@prip.tuwien.ac.at +Indiecam raphael@indiecam.com Abstract. As panoramic images are widely used further peripheral processing devices. in many applications, efficient image stitching meth- Common image stitching techniques take images ods that provide visually pleasant image mosaics are taken from different views and align them using needed. In this paper we discuss a novel concept for image registration in overlapping regions. So far, all smart camera image stitching based on graph pyra- images are collected and aligned centrally, which mids. For a multi-camera system, the images have suffers from high computational cost. Thus, we aim to be aligned accordingly to create an image mosaic. at parallelizing parts of this process by developping Instead of calculating the corresponding transforma- smart cameras that are able to perform some of the tions centrally, we aim at enabling each camera to in- image transformations themselves. dividually calculate the transformation of the image it takes. Graph pyramids used for image segmenta- The camera systems we consider use fish-eye tion provide information about the segmentation pro- lenses. General camera models such as the pinhole cess. We analyze how this information can be used to model cannot be applied to these lenses, because they calculate the transformations for image alignment. do not conform to the perspective projection due to their large field-of-view. Simple models are given for different projections of ideal fish-eye lenses. They 1. Introduction provide a formula for the radius r which is the dis- Panoramic views form the basis of many applica- tance between an image point and the principal point. tions including augmented reality applications. Pro- The principal point is the point where the optical axis ducing video content with high quality seamless and intersects the image plane. In case of the equidistant artefact-free 360◦ of coverage is challenging per se projection the radius is given by and even more challenging if all related processing, r = f θ, (1) especially seamless stitching, has to work automati- cally and in real-time for live productions. where f is the focal length and θ is the the incident A suitable approach has to solve a system conflict angle of the ray from an object point. However, this between omnidirectional simultaneous video capture formula does not reflect the behavior of real lenses. on one hand, which cannot be done from the nodal Instead, extended models are developped which take point due to mechanical collision problems, and into account the high level of distortion. Parameter parallax-free stitching of panoramas without any par- values are estimated using calibration, defining a allax, ghosting and distortion artefacts on the other final model for a particular camera [16]. In fish-eye hand. lenses, both radial as well as tangential distortion is Using high on-board computing power of smart present. While radial distortion reduces the spatial cameras and a dedicated communication network be- resolution towards the periphery of an image and tween cameras could be used to integrate the entire distorts rectilinear objects, geometric shifts are the image processing for automatic real-time stitching result of tangential distortion [15]. into the cameras themselves, avoiding the need for In this paper, we define our concept of smart cam- Additionally, there are some noticeable distortions era image stitching and present ideas how to realize caused by wide-angle optics such as distortion in the it. We will first give an overview of related work border regions of images in fish-eye lenses which in the fields of fish-eye lenses, image stitching and result in additional loss in image resolution. smart cameras in section 2. After declaring our func- tional goal in section 3, we discuss open problems Smart camera networks have a wide range of that we aim to solve in order to realize it and present applications in various areas including surveillance our ideas for possible solutions including novel ap- systems, security monitoring, traffic control and proaches in section 4. Finally, in section 5, we con-telemedicine [1]. For instance, Kawamura et. al [17] clude our paper. proposed a reliable surveillance system for railway stations. Their system tracks suspicious behavior 2. Related Work by applying multiple camera fields of view. Smart sensors communicate with each other over a wire- In this section we present a selection of state-of- less mesh network. Moreover, as an application for the-art techniques in fields that are related to smart the airport, Shirmohammadi et. al [31] introduced camera image stitching. a decentralized target tracking scheme. Smart cam- 2.1. Smart Cameras era nodes automatically identify neighboring sensors with overlapping fields and produce a communica- The name of smart camera goes back to the tion graph which reflects how the nodes will interact middle of 1970s [29] when Ron Schneidermann to fuse measurements in the network. applied it in developing systems for controlling the shutter. Then in 1981 the optical mouse was 2.2. Multi-View Setups and Image Stitching invented by Richard Lyon [24, 25] which was the first realized smart camera including an imaging Moreover, numerous publications deal with device and embedded processing unit as a compact panoramic images. For image registration, feature- system. ”Smart camera is a label which refers based methods which use distinct image points are to cameras that have the ability to not only take generally favored over area-based techniques which pictures but also more importantly make sense of compare images window by window [39]. Lowe et what is happening in the image.” [4, Chapter 2, al. [6, 21, 7] introduced scale-invariant feature points page 21] Smart cameras employ various concepts (SIFT) which have been widely used since. They of computer vision and machine vision which can use a 128-dimensional feature vector. Ke and Suk- extract useful information from images resulting thankar [18] adopted this approach, but reduced the in special decisions based on that information. dimension of the descriptor to 36. Alternatively, Bay Smart cameras can be classified into three main et al. [3] presented a faster method based on Haar categories including integrated, compact-system and wavelets using speeded up robust features (SURF). distributed smart cameras [4, chapter 2]. Integrated All these features work well with standard per- smart cameras can be further subdivided into three spective projection since they are invariant to affine types including single-chip [2, 11], embedded [20] transformations and provide a sufficient number of and stand-alone smart cameras. Distributed smart corresponding points to recover the parameters of the cameras involve some sort of networking and have homography. Multi-view images taken from cameras recently attracted significant interest in academic at different positions lead to parallax errors. These and industries fields [28]. Indeed, some problems errors cannot be fully eliminated. Still, these effects such as depth information in foreground detection can be reduced. Global image transformations that and occlusion are difficult to be solved by single are calculated by fitting a homography to matched smart cameras. In this case, using multiple cameras feature points cannot handle parallax well. Zhang with a powerful computing platform is an advantage. and Liu [38] address this problem by combining However, we encounter some physical limitations the transformation using a homography with local of the acquisition hardware. Although current content-preserving warping. The homography is no professional cameras capture images at a horizontal longer chosen as the best fir for all feature point pairs, resolution of about 4k to 5k [27], they are insuffi- but considers only neighboring feature points. Ad- cient for large scales and wide-angle viewpoints. ditionally, they use a tolerant fitting threshold. Per- azzi et al. [27] describe an algorithm for generating level distortion, however, fewer terms are needed videos from unstructured camera arrays. They apply with the division model [10]. Based on this approach, the basic concept of local warping to remove the par- Aleman-Flores et al. [12] formulate a one-parameter allax and define a new error measure with increased model. sensitivity to stitching artifacts. Their method tries to In order to determine lens parameters, various smooth out the blurring, ghosting and some other dis- calibration procedures [37, 19, 32, 36, 34] have tortions caused usually when videos which feed from been developped. In many cases, they extract fea- unstructured camera arrays are combined to create a tures such as lines or corners from the image of a single panoramic video. Deen et al. [9] create image calibration pattern for which the world coordinates mosaics for scientific purposes. Thus, they focus on are known [15]. A self-calibration method based correct rather than visually pleasant results. Parallax on circle-fitting which does not require information errors are reduced by performing pointing correction. about the objects’ world coordinates is presented by Existing tools for panoramic image stitching as Bräuer-Burchardt and Voss [5]. However, the distor- well as camera calibration include Hugin 1 and tion of an image needs to exactly fit the chosen distor- PTGui 2, which are both based on Panorama Tools 3 tion model. Aleman-Flores et al. [12] determine the by Dersch. distortion parameter automatically by introducing it into Hough space and detecting distorted lines. German et al.[14] investigate the application of different map projections to panoramic images in- cluding projections of fish-eye lens images. Multi- 3. Our Goal: a 360◦ Image Mosaic view setups are addressed by Sturm et al.[33] who develop a multi-view geometry model for central and We consider a multi-camera system of small high- non-central cameras based on structure-from-motion quality cameras, in order to create a 360◦ image mo- and by Luo et al.[23] who focus on saliency detection saic. The system consists of six fish-eye lens cam- in multi-camera setups. eras. At this point we use the indieGS2K model pro- duced by Indiecam4. Two adjoining cameras share 2.3. Fish-Eye Lenses an overlapping region, respectively. Position and op- tical parameters can be chosen arbitrarily, but will Schwalbe [30] develops a geometric model for be fixed for a specific system. Each camera creates fish-eye lens cameras based on the approximately an image using fish-eye projection. Additionally, it linear relation between the incident angle of the holds the information about the other cameras’ set- ray from an object point and the distance from the tings. In the end, an image mosaic using equirect- corresponding image point to the principal point. angular projection is created. This means that the Distortion is accounted for by using conventional horizon is a straight line in the middle of the image distortion polynomials. Alternatively, Kannala and and vertical lines in real world are vertical lines in Brand [16] present a flexible camera model which the image [14]. Before the actual stitching can be is applicable for fish-eye as well as narrow-angle performed, the respective images have to be trans- lenses. They use a polynomial imaging function as formed accordingly. well as two additional terms for radial and tangential distortion, respectively. The final camera model in- Eventually, our goal is to develop the respective cludes 23 parameters. It provides both a forward as coordinate transformation model. For any two im- well as a backward model. Moreover, Luhmann et ages Ij and Ij+1 from the six cameras with overlap- al. [22] deal with the correction of chromatic aberra- ping view, a function Fj : Ij → Ij+1 has to be found tion in fish-eye images. Standard distortion correc- such that F (pj) = pj+1 for all corresponding pix- tion methods use odd polynomial models as used by els (pj, pj+1) in the overlapping region with pj ∈ Ij, Mallon and Whelan [26]. These models describe the pj+1 ∈ Ij+1. The resulting algorithm should take distorted radius r the image of one camera as well as the settings (posi- d as a polynomial function of the undistorted radius r tion, optics) of the other as input.The output will be u, using only odd terms. For high the accordingly transformed image. 1http://hugin.sourceforge.net/ 2https://www.ptgui.com/ 3http://panotools.sourceforge.net/ 4www.indiecam.com 4. A Novel Concept for Image Alignment consistent image segmentation (SCIS) [8], the infor- mation about the segmentation process is stored. As At this point, the following problems have to be this process is performed based on the structure of solved in order to determine the image transforma- the underlying image, it also contains information tion: about the distortion. A target coordinate system is 1. Calibrate the fish-eye lens and determine the defined by the continuous curves of a checkerboard distortion. pattern which follow the isolines of the coordinate system. By applying the segmentation to this pattern 2. Calculate the transformation from the fish-eye and storing the details of the segmentation process, projection to the equirectangular projection. the distortion information of the coordinate system is retrieved. 3. Perform a geometrical classification of possible setups. Considering two cameras C1 and C2, calculate critical points and distances in order to distinguish between the following classes: 4.1.2 Features of the SCIS Algorithm • region in which points can only be seen by The SCIS algorithm segments an image based on Lo- C1 cal Binary Patterns and the Combinatorial Pyramid. • region in which points can only be seen by It works on the local structure of the image and pre- C2 serves structural correctness [8, Chapter 4, page 39] • closer part of the overlapping region with and topology of an image. For this purpose, five visible parallax errors topological classes based on Local Binary Patterns of regions are applied which by combination with • part of the overlapping region with negli- the dual graph are able to remove redundant struc- gible parallax errors tural information. As a result, by using this approach 4. Calculate the coordinate transformation the image graph will be simplified and connected re- gions will be merged without introducing structural In order to solve these problems described in errors [8]. the previous section, we consider the following ap- The SCIS algorithm performs image segmentation proaches. using a graph-based image representation. It pro- 4.1. Lens Calibration and Image Alignment using vides the image at any level of segmentation as well Graph Pyramids as the information about the segmentation process up to that level. The latter contains information about While traditional lens and distortion models have the distortion structure. been studied extensively, we follow a different ap- Initially, each pixel corresponds to a vertex and proach. Our goal is to extract the distortion informa- each edge to a neighborhood relation in the graph, tion using graph pyramids. which represents the base level of the combinatorial pyramid. Subsequently, pixels are merged to regions 4.1.1 Overview which are in turn merged to larger regions based on Traditionally, lens calibration is based on a geometric their intensity values. On higher levels, each ver- model depending on parameters. The respective pa- tex corresponds to an image region. Merging cor- rameter values are determined during the calibration responds to edge contraction and removal. The SCIS procedure. This is a characteristic that previous lens algorithm creates the entire pyramid as well as the calibration methods have in common, even though contraction history. The latter is represented by the different models and procedures have been devel- contraction kernels. Thus, it is able to reconstruct opped. By establishing a model for which the pa- the segmented image at any level. An example of a rameters are specified, these methods already make combinatorial pyramid is shown in Figure 1. fundamental assumptions about the structure of the An evaluation study of stereo matching by Joan- distortion. On the contrary, we propose a calibration neum Research [13] shows that the SCIS algorithm method that determines the distortion including its achieves the highest matching quality compared to structure. In a graph pyramid as used for structurally different compression methods. Figure 3. Multi-camera calibration setup for six cameras C1 - C6. Figure 1. Example of a Combinatorial Pyramid. Image taken from [8] mogeneous regions. Since they have all the same value it cannot be said which edges are contracted or which are removed. For making the process more precise we can consider two solutions. One is to apply geometry of target coordinates and perform linear interpolation. How- ever, this approach has the drawback that we do not know the size of the distorted patch, which is partic- ularly problematic in our case where we expect se- vere deformation. The second approach is to shift the checkerboard pattern and create a new image from a different viewpoint. By iteratively applying this pro- cess, the regions inside the patches will be refined. For instance, we can take M captures with different Figure 2. Distorted checkerboard pattern with correspond- offsets. Next, the idea is to freeze only the bound- ing primal graph at the top of the pyramid. Each vertex (yellow) corresponds to a patch. Vertices of the adjacent aries of which we are sure that they are precisely de- patches are connected by an edge (red). lineated. Indeed, by taking two different positions (randomly) and overlapping with the two contraction kernels, both boundaries should be preserved. There- 4.1.3 Calibration Procedure fore, the random space of patches will be smaller and The canonical representation of the combinatorial smaller as the process is used more and more. pyramid stores it as a single array. The elements in There are two ways for applying this strategy. On this array are half-edges, called darts. They are or- the one hand, it can be performed sequentially by dered according to the contraction history. In order freezing the contraction kernels corresponding to the to extract information about the distortion from the boundaries from the previous iteration. On the other combinatorial pyramid, we consider the image of a hand, it can be performed randomly. Given the con- checkerboard pattern, where each patch is assigned traction kernels at every point and knowing the po- an absolute coordinate. At the top level of the pyra- sition of a boundary, we can integrate the contrac- mid, each vertex corresponds to a single patch (see tion kernels using high weights at boundaries and low Figure 2). weights in between. For homogeneous regions, the As a result we get the contraction history. The contraction kernels provided by the shifting approach top level delivers a single vertex for every patch of will converge towards the proper kernel. the checkerboard with its adjacency. All contracted With the contraction kernels provided, the infor- edges of a patch form a spanning tree of the corre- mation about the distortion is stored implicitly, al- sponding region in the primal graph. We do not know lowing us to apply it to any new image. Conve- anything about the contraction kernels inside the ho- niently, the canonical representation stores this in- Figure 5. Calibration pattern using a spherical target coor- dinate system with radius r, azimuthal angle θ and eleva- Figure 4. Calibration pattern using a cylindrical target tion angle φ. coordinate system with radius r, azimuthal angle θ and height h. 4.2. Projection remapping formation in an ordered array. Thus, the calibrated The remapping from fish-eye to equirectangular kernels which have to be applied to get to a particu- projection can also be handled by the graph-based lar level of the pyramid can be re-used. calibration method presented in the previous section. The calibration setup for a multi-camera system is For comparison, it can be addressed individually fol- illustrated in Figure 3. lowing German et al. [14]. Information about the camera’s roll, which is the rotation angle about the optical axis, and pitch, which is the elevation angle 4.1.4 Advantages of Calibration using Graph from the horizontal axis, allows the remapping from Pyramids a fish-eye to an equirectangular projection. Roll and pitch can be determined manually or by using hori- Apart from the fact that the graph-based approach zontal or vertical control lines. does not make any assumptions about the structure of the distortion, it yields other advantages compared to previous calibration methods. Accuracy can be in- 4.3. Setup Classification creased simply and reached to the resolution of orig- inal images by increasing the number of shifts. Ad- The classification of the setup with regard to par- ditionally, we do not need a global model of the geo- allax errors can be performed using partial edge con- metric projection for calibration, which is needed for tours as used by Wang et al. [35]. The edge contour many estimation methods of the parameters. Finally, of an obstacle is mapped from one image to the other. our method does not depend on a particular coordi- The parallax is then calculated as the transverse dis- nate system. Instead, any target coordinate system tance between corresponding edge contour pixels. can be chosen. It is defined by the checkerboard pat- tern where the continuous curves correspond to the 4.4. Image Transformation the isolines of a target coordinate system. Thus, var- ious geometries can be used for this approach such Similar to the projection remapping, the image as cylindrical (see Figure 4) or spherical (see Fig- tranformation used for image alignment can be de- ure 5) coordinate systems. In particular, the coordi- termined by the graph-based approach. For compar- nate system of the final mosaic can be chosen as tar- ison, the calibration of the multi-camera system can get coordinate system. In this case, the transforma- be performed using feature extraction and matching. tion provided by the calibration method does not only For this purpose, SIFT features [21] will be used. In consider lens distortion, but also includes remapping order to reduce parallax errors, the image transforma- to equirectangular projection as well as image align- tion will be calculated following the parallax-tolerant ment, and this simultaneously for all six cameras. approach used by Zhang and Liu [38]. 5. Conclusion [11] B. Flinchbaugh. Smart cameras systems technol- ogy roadmap. In B. Kisacanin, V. Pavlovic, and We presented a novel concept for the smart cam- T. Huang, editors, Real-Time Vision for Human- era image stitching. It aims at reducing the cost of the Computer Interaction, pages 285–297, 2005. 2 stitching process by enabling each camera of a multi- [12] M. A. Flores, L. Á. León, L. G. Déniz, and D. E. S. camera system to align the image that takes individ- Cedrés. Automatic Lens Distortion Correction Us- ually. Lens calibration can be performed using graph ing One-Parameter Division Models. IPOL Image pyramids, which yields several advantages compared Processing OnLine (Special Issue on Lens Distor- to traditional lens calibration methods. Additionally, tion Models), 4:327–343, 2014. 3 the same method can be used to directly determine [13] B. Froehlich and M. P. Caballo-Perucha. Evaluation the image transformation required for image align- of image compression algorithms version 1.0, issue ment. Currently, the work is in progress, but in near D1, 2015. Joanneum Research. 4 future we are planning to experimentally prove the [14] D. M. German, P. d’Angelo, M. Gross, and B. Pos- applicability of the proposed ideas. tle. New Methods to Project Panoramas for Practi- cal and Aesthetic Purposes. In D. W. Cunningham, References G. Meyer, and L. Neumann, editors, Computational Aesthetics in Graphics, Visualization, and Imaging. [1] I. F. Akyildiz, T. Melodia, and K. R. Chowdhury. The Eurographics Association, 2007. 3, 6 A survey on wireless multimedia sensor networks. [15] C. Hughes, M. Glavin, E. Jones, and P. Denny. Re- Computer Networks, 51:921–960, 2007. 2 view of geometric distortion compensation in fish- [2] L. Albani, P. Chiesa, D. Covi, G. Pedegani, A. Sar- eye cameras. In IET Irish Signals and Systems Con- tori, and M.Vatteroni. VISoc: A smart camera SoC. ference, 208. (ISSC 2008), 2008. 1, 3 In Proceedings of the 28th European Solid-State Cir- cuits Conference, pages 367–370, 2002. 2 [16] J. Kannala and S. S. Brandt. A generic cam- [3] H. Bay, T. Tuytelaars, and L. V. Gool. Surf: Speeded era model and calibration method for conventional, up robust features. In In ECCV, pages 404–417, wide-angle, and fish-eye lenses. IEEE TRANS. 2006. 2 PATTERN ANALYSIS AND MACHINE INTELLI- GENCE, 28:1335–1340, 2006. 1, 3 [4] A. N. Belbachir, editor. Smart Cameras. Springer, 2010. 2 [17] A. Kawamura, Y. Yoshimitsu, K. Kajitani, T. Naito, [5] C. Bräuer-Burchardt and K. Voss. A new algo- K. Fujimura, and S. Kamijo. Smart camera network rithm to correct fish-eye- and strong wide-angle- system for use in railway stations. In SMC, pages lens-distortion from single images. Proceedings 85–90. IEEE, 2011. 2 2001 International Conference on Image Processing [18] Y. Ke and R. Sukthankar. PCA-SIFT: A more dis- 2001, Vol.1, pp.225-228, 2001. 3 tinctive representation for local image descriptors. [6] M. Brown and D. G. Lowe. Recognising panoramas. In Proceedings of the 2004 IEEE Computer Society In Proceedings of the Ninth IEEE International Con- Conference on Computer Vision and Pattern Recog- ference on Computer Vision - Volume 2, ICCV ’03, nition, CVPR’04, pages 506–513, Washington, DC, pages 1218–1225, Washington, DC, USA, 2003. USA, 2004. IEEE Computer Society. 2 IEEE Computer Society. 2 [19] M. Kedzierski and A. Fryskowska. Precise method [7] M. Brown and D. G. Lowe. Automatic panoramic of fisheye lens calibration. In Proceedings of image stitching using invariant features. Int. J. Com- the ISPRS-Congress, pages 765–768. International put. Vision, 74(1):59–73, Aug. 2007. 2 Society for Photogrammetry and Remote Sensing, [8] M. Cerman. Structurally correct image segmenta- 2008. 3 tion using local binary patterns and the combinato- [20] B. Kisacanin, S. Bhattacharyya, and S. Chai. Em- rial pyramid, 2015. Wien, Techn. Univ., Dipl.-Arb., bedded Computer Vision. Springer, 2007. 2 2015, Technical Report 133. 4 [21] D. G. Lowe. Distinctive image features from [9] B. Deen. In-Situ Mosaic Production at JPL/MIPL. scale-invariant keypoints. Int. J. Comput. Vision, Pasadena, CA : Jet Propulsion Laboratory, National 60(2):91–110, Nov. 2004. 2, 6 Aeronautics and Space Administration, 2012. Plan- etary Data: A Workshop for Users and Software De- [22] T. Luhmann, H. Hastedt, and W. Tecklenburg. Mod- velopers 2012, JPL TRS 1992+. 3 elling of chromatic aberration for high precision [10] A. Fitzgibbon. Simultaneous linear estimation of photogrammetry. Remote Sensing and Spatial In- multiple view geometry and lens distortion. Pro- formation Sciences, 36 (Part 5):173–178. 3 ceedings of the 2001 IEEE Computer Society Con- [23] Y. Luo, M. Jiang, Y. Wong, and Q. Zhao. Multi- ference on Computer Vision and Pattern Recognition camera saliency. IEEE Trans. Pattern Anal. Mach. 2001, Vol.1, pp.I-I. 3 Intell., 37(10):2057–2070, 2015. 3 [24] R. Lyon. The optical mouse, and architectural and F. Zhang. A practical distortion correcting methodology for smart digital sensors. In H.T.Kung, method from fisheye image to perspective projection B.Sproull, and G.Steele, editors, Computer Science image. In Information and Automation, 2015 IEEE Press. Invited Paper, CMU Conference on VLSI International Conference on, pages 1178 – 1183, structures and Computations, 1981. 2 2015. 3 [25] R. Lyon. Apparatus for controlling movement of a [37] X. Ying, Z. Hu, and H. Zha. Fisheye lenses cali- curser in computer display system, 1983. European bration using straight-line spherical perspective pro- Patent. 2 jection constraint. In P. J. Narayanan, S. K. Nayar, [26] J. Mallon and P. Whelan. Precise radial un-distortion and H.-Y. Shum, editors, ACCV (2), volume 3852 of images. Proceedings of the 17th International of Lecture Notes in Computer Science, pages 61–70. Conference on Pattern Recognition, 2004, Vol.1, Springer, 2006. 3 pp.18-21. 3 [38] F. Zhang and F. Liu. Parallax-tolerant image stitch- [27] F. Perazzi, A. Sorkine-Hornung, H. Zimmer, ing. In Proceedings of the 2014 IEEE Confer- P. Kaufmann, O. Wang, S. Watson, and M. Gross. ence on Computer Vision and Pattern Recognition, Panoramic video from unstructured camera arrays. CVPR ’14, pages 3262–3269, Washington, DC, In Proc. Eurographics 2015, volume 34, 2015. 2, 3 USA, 2014. IEEE Computer Society. 2, 6 [28] B. Rinner and W. Wolf. An introduction to dis- [39] B. Zitov and J. Flusser. Image registration meth- tributed smart cameras. In Proceedings of the IEEE, ods: a survey. Image and Vision Computing, 2003, volume 96, pages 1565–1575, 2008. 2 Vol.21(11), pp.977-1000. 2 [29] R. Schneidermann. Smart cameras clicking with electronic functions. Electronics, 48:74–81, 1975. 2 [30] E. Schwalbe. Geometric modelling and calibra- tion of fisheye lens camera systems. In Proceed- ings 2nd Panoramic Photogrammetry Workshop, Int. Archives of Photogrammetry and Remote Sensing, pages 5–8, 2005. 3 [31] B. Shirmohammadi and C. J. Taylor. Distributed tar- get tracking using self localizing smart camera net- works. In Proceedings of the Fourth ACM/IEEE In- ternational Conference on Distributed Smart Cam- eras, pages 17–24, New York, NY, USA, 2010. 2 [32] P. Srestasathiern and N. Soontranon. A novel cam- era calibration method for fish-eye lenses using line features. ISPRS - International Archives of the Pho- togrammetry, Remote Sensing and Spatial Informa- tion Sciences, pages 327–332, Aug. 2014. 3 [33] P. Sturm, S. Ramalingam, and S. Lodha. On cal- ibration, structure-from-motion and multi-view ge- ometry for general camera models. In R. Reulke and U. Knauer, editors, 2nd ISPRS Panoramic Pho- togrammetry Workshop,, Berlin, Allemagne, 2005. ISPRS. Published in the Int. Archives of Photogram- metry, Remote Sensing and Spatial Information Sci- ences, Vol. XXXVI-5/W8. 3 [34] S. Urban, J. Leitloff, and S. Hinz. Improved wide- angle, fisheye and omnidirectional camera calibra- tion. {ISPRS} Journal of Photogrammetry and Re- mote Sensing, 108:72 – 79, 2015. 3 [35] X.-H. Wang, W.-P. Fu, and W. Chen. Detection of obstacle based on nocular vision. 2010 International Conference on Intelligent Computation Technology and Automation, May 2010, Vol.2, pp.71-74. 6 [36] Z. Wang, H. Liang, X. W. andYipeng Zhao, B. Cai, C. Tao, Z. Zhang, Y. Wang, S. Li, F. Huang, S. Fu, 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 A concept for shape representation with linked local coordinate systems Manuela Kaindl and Walter G. Kropatsch Pattern Recognition and Image Processing Group, Vienna University of Technology, Austria http://www.prip.tuwien.ac.at Abstract. object’s element can be described as a transformation This paper discusses a concept for the repre- between two linked coordinate systems. Swinging of sentation of n-dimensional shapes by means of a the arm can be characterised as a transformation of model, based on linked local coordinate systems. the arms coordinate system in respect to the linked Through application of the medial axis transform coordinate system of the torso for movement of the (MAT) and decomposition of the resulting medial shoulder and transformation of the distal part’s coor- axis (MA), articulated, as well as non-rigid abstract dinate system in respect to the system of the upper n-dimensional bodies can be described by defining arm for movement of the elbow. The coordinate sys- corresponding local coordinate systems for each ele- tem of the hand in respect to the system of the fore- ment. This should allow a distinct and invariant rep- arm does not change in that case (Fig. 1). In case of a resentation of every point of the shape, which can be smooth deformation, local interpolation between the used for complex composite transformations of the transition of the elements may be needed. object in the context of robotic manipulation. 1. Introduction For the automatic manipulation of objects and rea- soning considering their attributes, a powerful model is needed. Articulated objects, like the human body, or deformable objects, like a piece of clothing, de- mand a model that is able to represent complex in- trinsic transformations. These classes of objects can be represented by defining coordinate systems for each segment, so every point of the object is dis- tinctly determined by a set of coordinates. One appli- cation, for both classes of objects mentioned, is auto- mated dressing-assistance for a person. Linked local coordinate systems should allow the description of every point of the shape, so it can be exactly defined where a robotic arm needs to grasp a glove and how it needs to place it for the person to slip in comfortably, considering the person’s range of motion. A coordinate system is specified by its origin, de- termining the location, and a set of basis vectors, Figure 1. Linked local coordinate systems of a swinging defining the orientation and scale of the element. arm. Frames indicating the area of a coordinate system. It makes the description of an element invariant to Forearm and hand do not move in respect to each other changes. In the case of articulated movements, the while the linked system of the distal part (parent of hand specific coordinates of the parts do not need to be and forearm) of the arm changes in respect to the system changed. The intrinsic movement of an articulated of the upper arm. The intrinsic movement of a non-rigid object is Handling and predicting articulated objects or supported by the model’s invariance to deformation non-rigid objects demands a complex model that can originating from the axial representation. The ob- represent the vast amount of different possible ap- ject’s axial representation provides the linked local pearances of an object. Several projects have already coordinate systems. In 3D space, axial representa- been dedicated to that issue. Li, Chen and Allen [11] tions can be produced by sweeping spheres along the used meshes of deformable objects to simulate the axis [16]. For 2D objects, geometric primitives, like movement and its results to identify grasping points circles or line segments, can be used as generators of garments. With a system of dictionary learning [14, 20]. The linked local coordinate systems are via spatial pyramid matching and sparse coding, a based on the resulting medial axis of the object using robotic grasper is enabled to grasp, flatten and fold an end point as the origin and a branch of the medial garments. Felzenszwalb, Mc Allester and Ramanan axis as a basis vector of the coordinate system. [5] published an algorithm for the recognition of de- Several problems need to be addressed to provide formable objects in images by means of a discrimina- a stable and invariant model that can represent an ob- tively trained, multiscale, deformable part model in ject and leads to reliable reasoning: 2008. Godec, Roth and Bischof [7] described hough- based tracking of non-rigid objects in 2013. Their 1. Noise approach utilises the generalised Hough-transform to handle articulated and non-rigid objects. Pouch et al. • Noise inside the shape creating holes. [13] resort to the MAT to segment the deformable • Noise along the boundary creating spuri- aortic valve apparatus in 3D echocardiographic im- ous branches. ages. To provide a stable basis for the concept, a MAT 2. Decomposition algorithm must be used which can provide a geomet- • Multiple affiliation of points in branching rically accurate and compact MA. In recent years, areas. several groups have been dedicated to improve prior efforts in that field. Li et al. published an approach 3. Preservation of structure for MAT by Quadratic Error Minimization to com- pute a stable and compact MA [10] The groups of • Ordering of axes at branching points. Zhu et al. published a paper on the constructive gen- eration of the medial axis for solid models [18] and 4. Special shapes also an approach for calculation of the medial axis • Spheres and objects based on spheres. of a CAD model by parallel computation [19]. Aich- holzer, Aigner, Aurenhammer and Juettler showed a • Circular MAs. technique for the MAT by means of a polyhedral unit The novelty of the method is the utilisation of ball instead of the standard Euclidean unit ball [2] linked local coordinate systems for the representation of n-dimensional objects for robotic manipulation. 3. Method MAT has the property of producing a MA of one The paper is organised as follows. In section 2, dimension less than the object in many cases. A related work is outlined. Section 3 describes the pro- 3D object creates a 2D MA and a MAT of a 2D posed method and its open problems in detail. Sec- MA generates a 1D manifold (Fig. 2.a) that can tion 4 concludes the paper with a discussion of the be decomposed at its branching points (Fig. 2.b, method. Fig. 2.c). As branching points we denote loca- tions where more than 2 branches of the MA meet. 2. Related Work These points represent the basis of convexities of Most recently, research in the field of robotic the shape. Points within the largest inscribing circle dressing-assistance was done by Gao, Chang and around these branching points, the branching area, Demiris, who utilise randomized forests for a model have an unclear affiliation to a MA branch, which of the upper body [6]. Klee et al. used a skeleton poses a problem when the connected MA branches tracker for a robotic dressing-application. move in respect to each other. The decomposed a b a b c Figure. 3 a) Elongated shape with MA and its Figure. 2. a) MAT of the image of a hand. Largest straightened representation (b). The representation is inscribing circles form the MA. b) Decomposed branch invariant to deformation. of the MA. c) Area to be described in respect to this branch. of MAT, decomposition and straightening creates a branch of the MA is straightened to form the x-axis graph with end points and branching points as nodes of a new coordinate system by replacing the geodesic and axis branches as edges (Fig. 5). distances by Euclidean coordinates. The distances along the MA stay identical, while the curvature is a b removed (Fig. 3). Figure 4. Coordinate system based on a MA branch. A point is defined by longitude and latitude. This makes the representation invariant to defor- Figure. 5 a) MAT of the image of a hand. b) Graph mation of the object, except stretching and compres- created by straightening the MA branches of the hand. sion, where the geodesic distance may change with movement. One end point of the axis can be cho- By means of the graph, the structure of the ob- sen as the origin. All points within the silhouette of ject can be identified. The graph concept is based on the object can be described as a tuple of longitude the notion of cellular complexes, described by Ko- along the axis and latitude as the distance of the point valevsky [9], which states that an n-dimensional ob- along the normal to the axis (Fig. 4). This procedure ject is confined by an (n-1)-dimensional object. The 1D MA is confined by 0D points, the 2D MA is con- ordinates of the MA are replaced by the geodesic co- fined by 1D curves and so forth. Based on this prin- ordinates the axis shall have within the shape. Fig. 8 ciple of cellular complexes and the attribute of MAT shows the 2D object that emerges from the composed to produce a MA of the objects dimensionality mi- MA. This 2D object itself can be used as the MA for nus 1 in many cases, it is assumed that the proposed a 3D object. This concept can be be continued due method holds for many n-dimensional objects by re- to the MAT’s attribute to create an object with the di- cursive application until 1-dimensionality is reached. mensionality of the object minus 1. So its reversal To communicate the principle of MA, we show leads to an object with the dimensionality of the MA how to build an abstract object from its MA. A shape plus 1. can be created by sweeping a circle along a 1D Axis as can be seen in Fig. 6. The MA is synonymous Figure 8. Covered area as 2D MA of a 3D object. Figure 6. Circles swept along a 1D MA. Transparency in- dicates the sweeping movement. 3.1. Noise Noise on the boundary of the shape can cause spu- with the x axis of a coordinate system we use to de- rious branches. Noise within the shape may cause fine all points of the shape. The radius of every circle holes, which can lead to circular MAs. Several at position x along the axis has to be stored to cre- projects are dedicated to the reduction of the influ- ate the intended object. This assures the preservation ence of noise on the MAT. Most recently Spitzner of shape. Given that the circles have to touch the and Gonzalez [17] published a method called Shape outline of the shape at at least 2 points at all times Peeling to improve the stability of image skeletons. and no circle is completely contained in another, the Abiva and Larsson [1] proposed a method to utilise silhouette of all the circles combined describes the the Scale Axis Transform to prune the MA of spuri- shape that is to be produced [3]. Noise on the bound- ous branches. Montero and Lang [12] published an ary of the object can cause spurious branches, mean- algorithm for skeleton pruning by means of contour ing branches of the MA that do not hold valuable in- approximation and the integer MAT in 2012. formation about the appearance of the shape. Noise within the object may cause holes and therefore cir- 3.2. Decomposition cular MAs. In Fig.7, we compose several branches to Decomposition is performed in branching areas to obtain less complex axes. Serino, Arcelli and Sanniti di Baja [15] recently described the decomposition of 3D objects at branching points to obtain meaningful object parts. In 2D, the branching area lies within the largest inscribing circle where 2 or more branches of the axis meet in the centre (Fig. 9). While the Figure 7. Circles swept along a composed 1D MA. Trans- points of the shape lying in a circle that only belongs parency indicates the sweeping movement. to one axis, are uniquely defined, points within the branching area can be described in relation to sev- one MA. The constellation of branches determines eral branches of the axis (Fig. 10). If branches move the structure of the object. The structure can have in respect to each other, these points shall each be different constraints in its movement, depending on affiliated with only one branch to preserve a unique the intrinsic mobility of the object. This topic is dis- representation. While Serino, Arcelli and Sanniti di cussed further in chapter 3.3 Preservation of struc- Baja [15] can already demonstrate impressive experi- ture. When creating the 2D object, the Euclidean co- mental results of the decomposition of the composed Figure 9. A point within a branching area can be described in relation to several branches of the axis. Figure 12. A sphere swept along the branching curve cre- ating a new 3D object based on a sphere. Figure 10. A point within a branching area can be de- The branching area itself can be seen as a 3D rod- scribed in relation to several branches of the axis. Axis like object or as a 4D object created by sweeping a a is extended across the centre, illustrating its negative do- 3D sphere along an axis. This implies a leap of at main. least 2 dimensions to reach the 1D MA, which vi- olates the assumption that the MAT reduces the di- 1D MA of 3D objects, MAs of higher dimensions mensionality of an object by 1. A problem that is yet require further research. to be solved and is explained further in the chapter In 3D, there can be branching points or branching 3.4.1 Spheres and objects based on spheres. curves where the branches of the MA meet as can A different approach is to apply the MAT recur- be seen in Fig. 11. In a first idea we approach the sively to every branch of the MA until 1D is reached. branching area as if it is an object itself. The branch- This way, joints will not necessarily imply a con- ing area of a curve we define by the largest inscribing nection of the MA branches (Fig. 13) and the MA sphere that is swept along the branching curve (Fig. branches of an object might not intersect. If the MA 12). breaks into several pieces, it arises the question of how the structure can be maintained. Further work on this matter is required. Figure 11. Two 2D MA branches of a 3D object forming Figure 13. 1D MA branches of the 2D MA branches do a branching curve where they intersect. not intersect. 3.3. Preservation of structure a b Articulated objects with a specific range of motion require constraints at joints, so the human forearm can not rotate around the elbow, but can only flex in one direction to a certain degree. Non-rigid objects, like cloth, require different constraints since they do not have joints, but feature a certain thickness, stiff- c d ness, weight and other properties. A basic ordering has to be maintained regardless of these characteris- tics. As shown in Fig. 14, all MA branches might be Figure 15. a) 3D sphere producing a 0D MA. b) Equator applied to a sphere to provide orientation for the spherical coordinate system. c,d) Shape described by sweeping a spherical coordinate system along a path. Figure 14. 3D branching point of MA branches. Branch a ates a 0D MA. Fig. 15.b illustrates the sphere after can move freely except across the triangles spanned by the other branches b, c and d. application of an equator to orient the spherical co- ordinate system. With these systems, all points of a able to move freely, provided they do not cross planes sphere can be distinctly determined. Objects based spanned by two different axes to sustain the objects on spheres imply that the shape can be created by organisation. The structure can be preserved by con- moving a sphere along a path (Fig. 15.c, Fig. 15.d). sidering the branches of the MA as edges and the end It follows, therefore, that every point of the object points and branching points as nodes of a combina- based on a sphere can be uniquely determined when torial map as described by Damiand and Lienhardt the spherical coordinate system is moved along the [4]. MA. 3.4. Special shapes 3.4.2 Circular medial axes There are several open problems regarding special shapes in the method that require further research. Circular medial axes occur when an object element Thoughts of the community on the matter are highly has genus higher than 0 (Fig. 16.a) and at concavities appreciated. of the object (Fig. 16.b). If a circular MA branch is connected to 1 or more other branches of the MA, the branching points can be used to decompose the cir- 3.4.1 Spheres and objects based on spheres cular MA and therefore create non-circular sections The concept of MAT is mostly built on the usage of that can be treated regularly. This is the case if the circles and spheres. If an object, or a part of it, itself object features a tail. Elements with genus higher is one of these primitives or based on the primitive in than 1 also feature connected MA branches because a higher dimension, the MAT will not create an ob- of the bridge between the holes whose MA branch ject of its dimension minus 1, but it may create a MA connects the sides. This leaves an issue for objects with a dimensionality even lower. This violates our with genus 1 and no tail (Fig. 16.a) and objects with basic assumption that this is the case. This means convex elements (Fig. 16.b). The n-dimensional that the MA can not be used to determine the loca- MA is not confined by a (n-1)-dimensional object, tion of points of the shape uniquely. One approach which violates one of the basic assumptions of this to solve this problem is to utilise spherical coordi- method, namely the concept of cellular complexes. If nate systems. Fig. 15.a shows a 3D sphere that cre- an object produces a circular MA without connected a b for coordinate-systems. The 1D elements as edges and their end points as nodes, form a graph that rep- resents the object. Articulated, as well as non-rigid objects can be described by defining corresponding coordinate systems of each element. This should al- low complex composite transformations of the ob- ject. Intrinsic movement does not imply the transfor- mation of point-clouds or meshes, but of linked local Figure 16. a) 2D circular MA within a tube-like object with an arbitrarily set reference (white). b) 2D circular coordinate systems. MA branch as part of an object’s MA with an arbitrarily Further work to be done on the project is to pro- set reference (white). vide a proof of concept, especially concerning the feasibility of the method for n-dimensions and res- branches, there is no reference point that can be used olution of the open problems described in this paper. as the origin of the coordinate system. A first at- tempt to solve this problem, based on the findings of Acknowledgements Illetschko [8], is to place an arbitrary reference point. We would like to thank the reviewers for construc- This point can be used as the origin of the coordinate tive feedback and the PRIP Club, the organization system based on the MA. Depending on the dimen- of friends and promoters of Pattern Recognition and sionality of the object, also a cut can be necessary. Image Processing activities Vienna, Austria, for sup- Points within the area of the new origin can then be port. defined in relation to both end points of the MA. References A special case is shown in Fig. 17. The torus is a shape based on a sphere, meaning that it can be [1] J. Abiva and L. J. Larsson. Towards automated fil- described as a sphere moved along a circular path. As tering of the medial axis using the scale axis trans- explained earlier, this enforces the use of a spherical form. In Research in Shape Modeling, pages 115– coordinate system. Also the torus has a circular MA, 127. Springer, 2015. 4 which requires an arbitrarily set reference point. [2] O. Aichholzer, W. Aigner, F. Aurenhammer, and B. Juettler. Exact medial axis computation for trian- gulated solids with respect to piecewise linear met- rics. In J. Boissonnat, P. Chenin, A. Cohen, C. Gout, T. Lyche, M. Mazure, and L. Schumaker, editors, Curves and Surfaces, volume 6920 of Lecture Notes in Computer Science, pages 1–27. Springer Berlin Heidelberg, 2012. 2 [3] H. Blum. A Transformation for extracting new de- scriptors of shape. MIT Press, 1967. 4 Figure 17. Special case: Torus is a shape based on a sphere [4] G. Damiand and P. Lienhardt. Combinatorial Maps: and creates a circular MA. From arbitrarily set reference Efficient Data Structures for Computer Graphics point on the MA (white), a spherical coordinate system is and Image Processing, volume 129. A. K. Peters, swept along the MA. Ltd. Natick, MA, USA, 2014. 6 [5] P. Felzenszwalb, D. Mc Allester, and D. Ramanan. 4. Conclusion A discriminatively trained, multiscale, deformable part model. Computer Vision and Pattern Recogni- This paper proposes an novel concept for the tion, 2008. CVPR 2008. IEEE Conference on. IEEE, representation of n-dimensional shapes through a 2008. 2 model, based on linked local coordinate-systems. [6] Y. Gao, H. Chang, and Y. Demiris. User modelling Through recursive application of the MAT and for personalised dressing assistance by humanoid decomposition of the resulting MA, some n- robots. In Intelligent Robots and Systems (IROS), dimensional objects can be reduced to multiple 1- 2015 IEEE/RSJ International Conference on, pages dimensional sub-elements that are used as the axis 1840–1845, Sept 2015. 2 [7] M. Godec, P. Roth, and H. Bischof. Hough-based [20] Y. Zhu, F. Sun, Y. Choi, B. Juettler, and W. Wang. tracking of non-rigid objects, volume 117. Elsevier, Computing a compact spline representation of the 2011. 2 medial axis transform of a 2D shape, volume 76. El- sevier, 2014. 2 [8] T. Illetschko. Minimal combinatorial maps for an- alyzing 3D data. Diploma Thesis, TU Wien, 2006. 7 [9] V. A. Kovalevsky. Finite Topology as Applied to Im- age Analysis, volume 2. Academic Press, 1989. 3 [10] P. Li, B. Wang, F. Sun, X. Guo, C. Zhang, and W. Wang. Q-mat: Computing medial axis trans- form by quadratic error minimization. ACM Trans. Graph., 35(1):8:1–8:16, December 2015. 2 [11] Y. Li, C. Chen, and P. Allen. Recognition of de- formable object category and pose. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2014. 2 [12] A. Montero and J. Lang. Skeleton pruning by con- tour approximation and the integer medial axis trans- form. Computers & Graphics, 36(5):477–487, 2012. 4 [13] A. Pouch, S. Tian, M. Takabe, H. Wang, J. Yuan, A. Cheung, B. Jackson, J. Gorman, R. Gorman, and P. Yushkevich. Segmentation of the aortic valve apparatus in 3d echocardiographic images: De- formable modeling of a branching medial structure. In Statistical Atlases and Computational Models of the Heart - Imaging and Modelling Challenges, vol- ume 8896 of Lecture Notes in Computer Science, pages 196–203. Springer International Publishing, 2015. 2 [14] A. Rosenfeld. Axial representations of shape, vol- ume 33. Academic Press Professional, 1986. 2 [15] L. Serino, C. Arcelli, and G. Sanniti di Baja. From skeleton branches to object parts, volume 129. El- sevier, 2014. 4 [16] E. Sherbrooke, N. Patrikalakis, and E. Brisson. An Algorithm for the Medial Axis Transform of 3D Poly- hedral Solids, volume 2. IEEE Educational Activi- ties Department Piscataway, 1996. 2 [17] M. Spitzner and R. Gonzalez. Shape peeling for improved image skeleton stability. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, pages 1508–1512, 2015. 4 [18] H. Zhu, Y. Liu, J. Bai, and X. Ye. Construc- tive generation of the medial axis for solid models. Computer-Aided Design, 62:98 – 111, 2015. 2 [19] H. Zhu, Y. Liu, J. Zhao, and H. Wang. Calculat- ing the medial axis of a {CAD} model by multi-cpu based parallel computation. Advances in Engineer- ing Software, 85:96 – 107, 2015. 2 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 A Computer Vision System for Chess Game Tracking Can Koray Emre Sümer Department of Computer Engineering Department of Computer Engineering Bas¸kent University Bas¸kent University Ankara, TURKEY Ankara, TURKEY cannkorayy@gmail.com esumer@baskent.edu.tr Abstract. In this paper, we present a real-time sys- track by the system. tem that allows the detection of the moves of a chess On the other hand, in a study conducted by Bennet game. In the proposed approach, each captured and Lasenby [2], the recognition of chessboards un- video frame, from a RGB webcam positioned over the der deformation was carried out. Their method deter- chessboard, is processed through the following steps; mined a grid structure to detected vertices of a chess- the detection of the corner points of the chessboard board projection. Further, the same authors devel- grids, geometric rectification, chessboard position oped a feature detector named ‘Chess-board Extrac- adjustment, automatic camera exposure adjustment, tion by Subtraction and Summation (ChESS)’ to re- intensity adjustment, move detection and chessboard spond to chessboard vertices [3]. In a different study, drawing. All steps were implemented in MATLAB a chessboard recognition system was proposed [8]. programming environment without using any chess The proposed system was applied to chessboard in engine. The proposed approach correctly identified order to identify the name, location and the color of 162 of 164 moves in 3 games played under different the pieces. Piskorec et al. [10] presented a com- illumination conditions. puter vision system for chess game reconstruction. The system reconstructs a chessboard state based on video sequences obtained from two cameras. 1. Introduction The tracking of the chess moves can be regarded There are many systems of computer vision, as the preliminary task before designing a robotic which require algorithms to be able to recognize chess playing system. In the literature, there are different objects and scenes. Since, chess game several efforts that perform the chess move tracking. has become an interesting issue in terms of human- The studies conducted by Matuszek et al. [9], Urting computer interaction systems, a computer vision sys- and Berbers [12], Cour et al. [4] and Gonc¸vales et al. tem is needed for chess playing and chessboard [7] use unique algorithms to identify the chessboard recognition system. grids along with the classification of squares. These There are various published techniques related to methods are not only based on corner detection but chess-playing systems. Sokic and Ahic-Djokic [11] also rely on having a clean background. proposed a computer vision system for chess playing In this paper, we propose a real-time chess game robot manipulator as a project-based learning sys- tracking system using a RGB webcam positioned tem. The proposed algorithm detects chess moves by over the chessboard. In general, the move is detected comparing frames captured before, during and after by comparing the occupancy grids based on average a move, and finds the difference between them. In a color information of the pieces and the squares. Be- similar study, Atas¸ et al. [1] developed a chess play- fore that, several pre-processing steps are employed ing robotic arm system composed of various modules including geometric rectification, intensity adjust- such as main controller, image processing, machine ment and chessboard position adjustment. The sys- learning, game engine and motion engine of robot tem also works successfully under different illumina- arm. In their study, the top of the pieces are uniquely tion conditions by means of automatic camera expo- designed to be different from each other in order to sure adjustment. Besides been a tracking system, the proposed system can also perform 2D reconstruction of the chessboard states and generate movement logs. 2. Equipment and Setup In this work, a setup is prepared to detect the moves of the pieces during the game. The setup has the Logitech c310 webcam for the capturing footage. The camera which has 5 megapixels resolution is ca- pable of HD 720p recording. The camera has no aut- ofocus functionality. Only the exposure mode from the camera settings is changed to ‘manual mode’ for move detection process. The webcam was used on a mid-range notebook. The chessboard and pieces are selected to meet World Chess Federation (FIDE) re- quirements in terms of color and size [6]. The board and the pieces have different colors from each other. The colors of the pieces are black and white, while the board has dark and light brown colored squares. The camera is positioned over the chessboard by a long and flexible holder as shown in Figure 1. Figure 2. The overall framework Figure 1. The image of the setup located, the saturation value of the captured image is increased gradually as a pre-process step. Once all grid corners are located, the second step is to locate 3. The Overall Framework the chessboard corners (point-C in Figure 3(c)). The grid corner points which are closest to the corners The general block diagram of the proposed sys- of the image are selected as pivot points. Point-A in tem is given in Figure 2. The details of the steps of Figure 3(c) is one of the pivot points. The diagonal the proposed framework are given in the further sub- closest inner point to the point-A is point-B, which is sections. shown in Figure 3(c). The reflection of the point-B 3.1. Chessboard Grid Corner Detection over the point-A is the point-C, which is the one of the chessboard corners as shown in Figure 3(c). This In this process, the first step is to find all grid cor- procedure is applied for all remaining corner points. ner points of the chessboard (Figure 3(a)) by using the snapshot of the camera. To find grid corners (Fig- 3.2. Geometric Rectification ure 3(b)), we used detectCheckerboardPoints func- tion of MATLAB. The function that is particularly The geometric rectification is an essential step to used in camera calibration gets an RGB image as an isolate the chessboard from the environment and cor- input and returns the located grid corners and the size rect the perspective distortion of the chessboard to of the board as an output. Until all grid corners are pave the way for the other processes. The chessboard is warped from its corner points which are located in the previous section to coincide with our predeter- mined size square corners (480x480px) (Figure 4). (a) Figure 4. The chessboard before geometric rectification step This process is applied only once before the game starts therefore, either the camera or the board should not be moved during the game. The geometrically corrected chessboard is presented in Figure 5. (b) Figure 5. The chessboard after geometric rectification step 3.3. Chessboard Position Adjustment (c) Figure 3. (a) The original chessboard image, (b) the de- To ease the calculations of the future processes, tected chessboard grid corners and (c) the related points the white pieces are needed to be positioned at the to chessboard grid corner detection bottom of the view. Thanks to camera position, we know that the positions of the pieces have to be on the left and right side of the camera. The comparison of the average colors of the both side’s king square gives us the position of the white pieces. According to the white side position, a new transformation matrix is computed to be used in the future warping processes. In Figure 6, the white pieces are located at the bottom while the black ones are at the top. Figure 6. The chessboard after chessboard position adjust- ment Figure 7. The pseudo-code of the automatic camera expo- sure adjustment 3.4. Automatic Camera Exposure Adjustment The built-in automatic exposure mode of the cam- era may cause undesirable image acquisition for the move detection. In this mode, the camera continu- ously adjusts the exposure level according to the cap- tured footage. Especially, whenever the player makes a move, the camera changes its exposure level due to the player hand on the captured image. In addition, the exposure level which is adjusted by built-in au- tomatic exposure mode of the camera can be under or overexposure. In order to find optimum exposure level of the camera, it needs to be adjusted manually at the beginning of the game. The aim of this process is to get correct color values as much as possible by preventing under and overexposure situations. We Figure 8. The chessboard after automatic camera exposure proposed our automatic camera exposure algorithm adjustment that aims to find the optimum exposure level which maximizes the average of the color differences be- 3.5. Intensity Adjustment tween light/dark piece and square (Figure 7). The calculated optimum exposure level is set to camera To improve the image quality, a set of enhance- as a new exposure level for the following processes. ments is applied to the snapshots of the camera. The In the present case the computed exposure level was first one is to reduce the noise problem. We used a computed to be -6 where the full range is between - 5x5 median filter to minimize the noise level of the 9 (the darkest) and 0 (the lightest). The snapshot of images. The second one is to increase the saturation the chessboard after applying the computed exposure of the image to enhance colors. After this process, level is given in Figure 8. the average colors of pieces and squares are calcu- lated to be used in further processes. The image of • Reference color of the light pieces is calculated the chessboard after the intensity adjustment step is by taking the average of the 16 squares that are illustrated in Figure 9. occupied by the light pieces. • Reference color of the dark pieces is calculated by taking the average of the 16 squares that are occupied by the dark pieces. • Reference color of the light squares is calculated by taking the average of the 16 light squares that are not occupied by any pieces. • Reference color of the dark squares is calculated by taking the average of the 16 dark squares that are not occupied by any pieces. 3.7. Move Detection The implementation of the move detection is based on a comparison between the reference im- Figure 9. The chessboard after intensity adjustment age and the snapshot of the camera. For this pro- cess, the reference image is used as the first snapshot 3.6. Average Color References which is taken after each valid move. The first ref- After all enhancements, in order to get color val- erence image is regarded as the first snapshot of the ues of each square of the chessboard, the image of footage. During the game, the average color differ- the chessboard (Figure 9) divided into 64 identical ence is calculated between the reference image and pieces each in correspondence to a square of the snapshots. Whenever the result of the calculation ex- board. Therefore occupancy grids are created for the ceeds a predetermined threshold, we conclude that chessboard. After that, it is defined a region of in- the player makes a move. After the result goes down terest (ROI) for each square (grid) of the chessboard. below the threshold, we assume that the player fin- The primary aim of using ROIs is to get color infor- ished the move. mation of the piece. ROI is defined as a 25x25px At this point, the last snapshot is interpreted to de- rectangle from the center of each square as shown in termine the color and position of the pieces. Before Figure 10. this process, the last snapshot is warped and the en- hancements are applied to the warped image of the chessboard. The ROI within each square of the im- age is compared with the four reference colors which are determined in section 3.6. In this comparison, the color differences are calculated in Lab color space by computing the deltaE value that represents the Eu- clidean distance of the related items. As a result of the comparisons, the reference color that gives the minimum deltaE value determines if a grid cell is a Figure 10. The orange colored region of interest superim- square or a piece with light or dark color. By ap- posed on the pawn plying this process to all squares of the chessboard, the chessboard state of the last snapshot is revealed. At the beginning of the game, before the move The state of the last snapshot and the previous chess- detection, the average color values of the light/dark board state are compared to detect the move of the pieces and squares are received and recorded as ref- piece. The previous chessboard state represents the erence values. chessboard state of the last valid move. At the begin- The reference colors of each type of piece and ning of the game, the first state of the game is stored square calculated as follows: as the previous chessboard state. When the state of the snapshot and the previ- ous chessboard state are compared, six different out- comes can be obtained: 1. If there is no difference between previous and last states, this means there is no change in the game. For this reason, the color difference over the board is not a move. 2. If there are only one occupied and only one un- occupied squares difference with the same piece color then this is a move. 3. If there are two occupied and two unoccupied squares difference with the same piece color, then this is a special move called ’castling’. 4. If there are one occupied and one unoccupied Figure 11. The reconstructed chessboard state with move squares difference with the same piece color and list one unoccupied square difference with the other piece color, then this is another special move The saturation enhancement which is applied to called ’en passant’. the images taken from capturing footage helped to in- crease the accuracy of the average color differences. 5. If there is only one unoccupied square differ- The combination of the lighting, camera settings ence and if there is a piece color change to the and chess set are playing a big role in the success previous piece color of the unoccupied cell in of detecting moves in a chess game. Although the any other occupied square, then this is a captur- proposed system works well under different illumi- ing move. nation conditions, lighting environments (having a single light source) that cast strong shadows over the 6. For all other conditions, the result of the com- board are unsuitable for tracking. parison is not a move. On the other hand, shadows over the light pieces If the result is a move then the state of the chess- are another important problem. This makes difficult board is updated as the last chessboard state. The to separate the light pieces from the light squares, as move is added to the move list and the last state of in 2 of 164 undetected moves during the experimen- the chessboard is reconstructed in 2D. An example tal evaluation. In addition, this problem may cause to 2D state reconstructed from a test game and the move get incorrect results from the automatic camera expo- list are presented in Figure 11. The moves are logged sure adjustment method. as standard algebraic notation which is the notation Shadows and specular reflections over a particu- standardized by World Chess Federation (FIDE) [5]. lar area of the chessboard can break the uniformity Note that all the steps of the proposed methodology of the colors. In these conditions, a chess game can- including the graphical user interface were imple- not be tracked by the proposed system. Besides, due mented in MATLAB. to the reference colors of pieces and squares are de- termined at the beginning of the game, the overall 4. Experimental Evaluation and Discussion illumination of the environment should not change dramatically during the game. Otherwise, the move In order to test the system, three chess games are detection cannot be possible by the system. played at different times having different illumina- tion conditions. In these tests, 162 moves of all 164 5. Conclusion moves are successfully detected by the system. The corner points of the chessboard are successfully lo- In this paper, we have presented a real-time sys- cated in all games. The system performance was tem that performs the detection of the chess moves. found to be satisfactory to detect moves in real-time. The preprocessing steps are found to be quite useful. In particular, automatic camera exposure adjustment tion. Proceedings of the 34th International Conven- highly reduces the color ambiguities. The environ- tion, pages 870–876, 2011. 1 ments which are heavily under the influence of direc- [11] E. Sokic and M. Ahic-Dokic. Simple computer vi- tional lights are not recommended because of casting sion system for chess playing robot manipulator as strong shadows. The results of the played games in- a project-based learning example. Proceedings of the IEEE International Symposium on Signal Pro- dicate that the proposed system can be an affordable cessing and Information Technology, pages 75–79, and efficient option among chess game tracking sys- 2008. 1 tems. [12] D. Urting and Y. Berbers. Marineblue: A low- As an addition to the current system, a chess move cost chess robot. Proceedings of the International validation system is under progress to interpret the Conference Robotics and Applications, pages 76– player moves. By this way, the system not only 81, 2003. 1 tracks the position of the pieces but also validates the movements according to the type of the piece. Therefore, the future system can be used to help de- cision making and monitoring by referees and anti- cheat committee. References [1] M. Atas¸, Y. Do˘gan, and ˙I. Atas¸. Chess playing robotic arm. Proceedings of the IEEE 22nd Signal Processing and Communications Applications Con- ference, pages 1171–1174, 2014. 1 [2] S. Bennet and J. Lasenby. Robust recognition of chess-boards under deformation. Proceedings of the 20th IEEE International Conference on Image Pro- cessing, pages 2650–2654, 2013. 1 [3] S. Bennet and J. Lasenby. Chess – quick and robust detection of chess-board features. Computer Vision and Image Understanding, 118:197–210, 2014. 1 [4] T. Cour, R. Lauranson, and M. Vachette. Au- tonomous chess-playing robot, 2006. 1 [5] FIDE. Handbook, 2015. Laws Of Chess. 6 [6] FIDE. Handbook, 2015. Standards of Chess Equip- ment and Tournament Venue. 2 [7] J. Gonc¸alves, J. Lima, and P. Leitao. Chess robot system: A multi-disciplinary experience in automa- tion. Proceedings of the 9th Spanish-Portuguese Congress on Electrical Engineering, 2005. 1 [8] I. M. Khater, A. S. Ghorab, and I. A. Aljar- rah. Chessboard recognition system using signature, principle component analysis and color information. Proceedings of the Second International Conference on Digital Information Processing and Communica- tions, pages 141–145, 2012. 1 [9] C. Matuszek, B. Mayton, R. Aimi, M. P. Deisenroth, L. Bo, R. Chu, M. Kung, L. LeGrand, J. R. Smith, and D. Fox. Gambit: A robust chess-playing robotic system. Proceedings of the IEEE International Con- ference on Robotics and Automation, pages 4291– 4297, 2011. 1 [10] M. Piskorec, N. Antulov-Fantulin, J. Curic, O. Dragoljevic, V. Ivanac, and L. Karlovic. Com- puter vision system for the chess game reconstruc- 21st Computer Vision Winter Workshop Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.) Rimske Toplice, Slovenia, February 3–5, 2016 Fast L1-based RANSAC for homography estimation Jonáš Šer´ych, Jiř´ı Matas, Ondřej Drbohlav Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Cybernetics, Center for Machine Perception, Technická 2, 166 27 Praha 6, Czech Republic {serycjon,matas,drbohlav}@fel.cvut.cz Abstract. We revisit the problem of local optimiza- sistent with all correct correspondences. tion (LO) in RANSAC for homography estimation. The problem was first addressed in a paper by The standard state-of-the-art LO-RANSAC improves Chum et al [2] who proposed an additional RANSAC the plain version’s accuracy and stability, but it may step, the so called local optimization (LO). The LO be computationally demanding, it is complex to im- step is employed whenever a new candidate model plement and requires setting multiple parameters. M is the best one so far found in the RANSAC loop, We show that employing L1 minimization instead of i.e. it has more inliers than any of the models esti- the standard LO step of LO-RANSAC leads to results mated from the random minimal samples evaluated with similar precision. At the same time, the pro- so far. Chum et al [2] proved that with the strategy, posed L1 minimization is significantly faster than the the LO step is run only log(k) times, where k is the standard LO step of [8], it is easy to implement and number of random models tested. it has only a few of parameters which all have in- The local optimization step[2] performs various tuitive interpretation. On the negative side, the L1 heuristic procedures with the objective of increasing minimization does not achieve the robustness of the the accuracy of M , such as generating hypotheses by standard LO step, its probability of failure is higher. resampling the inliers of M and performing iterative least square estimation combined with scheduled in- lier threshold changes. The standard implementation 1. Introduction of RANSAC with the local optimization step, found RANSAC [3] is a robust model fitting algorithm in the commonly used publicly available code [10], is that is the standard method used for two-view geom- a combination of the above-mentioned heuristic pro- etry estimation [5]. The plain version of RANSAC cedures. proceeds as follows: (i) randomly sample the mini- The choice and parameter settings of local opti- mum number of points required to calculate model mization methods influence the speed and accuracy parameters, (ii) compute the cardinality of the set of the algorithm. In the state-of-the-art version [8], consistent with that model, i.e. the number of in- the LO step executes a complex procedure which in- liers, and (iii) terminate if the probability that a bet- volves repeated sampling from inliers of M and re- ter model than the one best so far will be found peated iterative least squares minimisation. As the falls under a predefined threshold. The precision of sampling is involved, it is stochastic1. Due to both the model returned by the algorithm is typically im- repeated sampling and iterative least squares, it is proved by least square fitting of the inliers of the best so computationally demanding, in comparison with mode. RANSAC steps (i) and (ii), that despite being exe- cuted only rarely, the LO step significantly influences It has been observed [11] that the termination cri- the overall running time. terion (iii) stops the process later than expected given In this paper, we propose to replace the complex the recovered percentage of inliners. The discrep- LO procedure of Lebeda et al. by minimization of the ancy is due to a generally incorrect (overoptimistic) assumption that every minimal sample of inliers gen- 1Since the outer loop of RANSAC is stochastic, the inner erates a “good” model, i.e. a model that will be con- sampling does not change the character of the algorithm. 1: procedure STANDARD LO 1: procedure L1-BASED LO 2: Input: M (model estimated by LSq), 2: Input: M (minimum sample model), I (inliers) I (inliers) 3: for r = 1 → reps do 3: while stopping condition not met do 4: sample S drawn from inliers 4: M ← model estimated from inliers by 5: model M is estimated from S IRLS optimization 6: iterative least squares on M 5: I ← inliers to M 7: end for 6: end while 8: return best model 7: return M 9: end procedure 8: end procedure Table 1: Comparison of the standard and proposed local optimization procedures in RANSAC – left and right columns, respectively. IRLS stands for interated re-weighted least squares. Note that the standard LO method includes several rounds of IRLS s which are themselves computationally demanding (for details, see [8]).The iteration stops if either the change in the cost function is below 10−3 or the maximum number of iterations is reached (set to 5). sum of L1 norms of the residuals, ie. the algebraic Weiszfeld proved that the geometric mean minimia- errors of the model on individual points. The mini- tion by IRLS requires solving repeated least squares mizer of the L1 norm, also known as geometric me- problems where each data point is weighted by the dian, is robust to a modest contamination by outliers. reciprocal of its residual to the current estimate of This means that RANSAC becomes less sensitive to the model. The algorithm has to be modified to avoid the inlier-outlier threshold. The threshold, a critical singularities when some point is exactly consistent parameter of RANSAC, can be set more loosely and with the model, i.e. it has a zero residual. To avoid thus cover a wide range of problems. Moreover, the the problem, we replace L1 minimization with Huber L1-based procedure need not include least square es- kernel minimization. In the implementation, it only timation with multiple thresholds, thus saving time. means that points with small residuals are not scaled. In practice we replace the L1 norm by the Huber First, the necessary notation is introduced. The L22 robust kernel response to the inlier algebraic errors. norm (for a vector r ∈ D R ) is defined as: The Huber cost function is defined in Eq. 6. The Hu- ber kernel is differentiable and convex and the global D X k | minimum of the cost function can be found by gra- rk22 = rj|2, (1) dient descent. The gradient minimization alternates j=1 with the inlier-outlier selection process. The alternat- the L1 D ) as ing minimisation can be seen as local optimization of 2 norm (for r ∈ R the truncated Huber kernel. The procedure has only a v D small number of parameters that have intuitive mean- u X k u rk1 |r ing, it is simple, and deterministic. 2 = t j |2 (2) j=1 We show that the minimization produces errors which are comparable to the standard LO-RANSAC, 2.1. Homography estimation by algebraic error while being computationally much less expensive – minimization in L2 and L1 norms of an order of magnitude in our experiments com- 2 2 pared to the standard local optimization. Let the number of correspondences be M . The data matrix Z is computed from correspondences by 2. Method a standard procedure ([5]): Let (x, y) and (x0, y0) be the correspondence pair. It generates two rows into The difference of the standard and proposed LO the data matrix Z: method is presented in Table 1. The L1 mini- mization is carried out by iterated reweighted least x y 1 0 0 0 −x0x −x0y −x0 . squares (IRLS). The particular instantiation of IRLS 0 0 0 x y 1 −y0x −y0y −y0 is knows as the generalized Weiszfeld algorithm [1]. (3) Let z(i) denote the two rows generated by i-th corre- 6: Recompute h using L22 optimization (4) spondence. The homography h is estimated from Z 7: end while by one of the following optimizations: 8: end procedure The L2 optimization 2 The iteration stops when M X (i) X (i) X r − r ≈ 0 h = argmin kz(i) ˆ hk2 t (t+1) 2 , (subj. to ˆ h> ˆ h = 1) î i h i=1 (4) , i.e. if the value of the cost function does not change The minimization is solved by computing the spec- between consecutive iterations or after 5 iterations tral decomposition of Z>Z and taking the eigenvec- are completed. The second condition reflects the em- tor corresponding to the smallest eigenvalue. The pirical observation that most of the time, the IRLS algorithm has the following properties: it is fast, algorithm converges after 3 iterations and it is used but not robust with a breakdown point of zero [7] only as a guarantee against an infinite loop. – in general a single outlier can make h arbitrarily In the case of L12 optimization, the weight w(i) is wrong2. set to 1/ kr(i)k1 + δ 2 . A small constant δ is used The L1 optimization, defined as to avoid the problem of dividing by zero when the M residuals vanish. X h = argmin kz(i) ˆ hk1 , (subj. to ˆ h> ˆ h = 1) The L1 optimization proposed above introduces ˆ h i=1 additional parameter δ in order to deal with the di- (5) vision by zero, but its interpretation is not clear. Us- is robust and can be solved by the generalized ing the Huber cost function instead of the L1 norm Weiszfeld algorithm, an instance of IRLS. Instead avoids the numerical issue. The weight w(i) is set as of modifying the algorithm to take of the techni- follows ([13]). cal problems associated with the Weiszfeld algorithm ( arising if one of the residuals vanishes, we instead 1 : kr(i)k1 ≤ k w(i) = 2 optimize the response to the Huber kernel. k/kr(i)k1 : kr(i)k1 ≥ k 2 2 Huber optimization is defined as The additional Huber parameter k can be intuitively M ( 1 X kr(i)k2 : kr(i)k1 ≤ k seen as a smoothing factor between L2 h = argmin 2 2 2 2 and L1 2 norms or, alternatively, like a lower bound on the inlier ˆ k(kr(i)k1 − k ) : kr(i)k1 ≥ k h i=1 2 2 2 threshold. (subj. to ˆ h> ˆ h = 1 and r(i) = z(i) ˆ h) The motivation for using this optimization is its (6) robustness. It is closely related to geometric me- dian computation and the formulation is convex.It is The minimization is carried out by a slightly a well known property of median that it is robust to modified Weiszfeld algorithm ([12]), an iterative outliers for up to 50% contamination of samples by reweighted least squares method: the outliers. The property makes the procedure non 1: procedure IRLS OPTIMIZATION sensitive to the choice of the inlier-outlier threshold 2: Initialize h as the estimate obtained from the of the “outer” RANSAC loop. minimal sample h 3: while stopping condition not met do 3. Experiments. 4: Compute the geometric error r(i): We compared the standard RANSAC, the state- r(i) = kz(i)hk12 (∀i = 1, 2, ..., M ) (7) of-the-art LO-RANSAC and the proposed L1-based RANSAC on a dataset consisting of 42 image 5: Reweight Z: pairs, including selected images from the ZuBuD p z(i) ← w(i)z(i) (8) dataset [4], images from Lebeda’s homog dataset [9] (∀i = 1, 2, ..., M ) used for evaluation of the LO-RANSAC, and images 2 from the symbench dataset [6]. The Hessian Affine In RANSAC, the error on a single point is bounded by the inlier threshold. In practice, points close to the the inlier-outlier feature detector with SIFT descriptor was used for boundary make the outcome of standard RANSAC unstable. obtaining the tentative correspondences. Image Qty↓ RANSAC LO-RANSAC L1-based RANSAC I 953.2 ±0.9 (950-956) 953.0 ±0.0 (953-953) 953.0 ±0.1 (952-953) LO time (µs) 0.0 ±0.0 (0-0) 29158.8 ±3383.8 (27499-42497) 3934.6 ±1035.1 (1901-6479) I (%) 76.9 ±0.1 (77-77) 76.9 ±0.0 (77-77) 76.9 ±0.0 (77-77) 05 Samp 11.8 ±5.7 (7-35) 11.8 ±5.7 (7-35) 7.5 ±1.9 (7-19) Time(ms) 6.1 35.5 13.7 Error 0.74 ±0.05 (0.6-0.9) 0.72 ±0.00 (0.7-0.7) 0.73 ±0.01 (0.7-0.8) LO count 0.0 ±0.0 (0-0) 1.0 ±0.0 (1-1) 2.2 ±0.9 (1-5) I 250.8 ±1.1 (244-252) 251.0 ±0.0 (251-251) 251.0 ±0.0 (251-251) LO time (µs) 0.0 ±0.0 (0-0) 10922.9 ±1797.6 (8737-15292) 1318.5 ±214.1 (917-1917) I (%) 97.6 ±0.4 (95-98) 97.7 ±0.0 (98-98) 97.7 ±0.0 (98-98) Samp 5.0 ±2.6 (2-14) 5.0 ±2.6 (2-14) 2.0 ±0.3 (2-4) adam Time(ms) 2.3 14.4 4.4 Error 1.15 ±0.45 (0.4-2.8) 0.77 ±0.05 (0.6-0.8) 0.79 ±0.02 (0.6-0.8) LO count 0.0 ±0.0 (0-0) 1.0 ±0.0 (1-1) 1.4 ±0.5 (1-2) I 328.4 ±0.5 (328-330) 328.0 ±0.2 (328-329) 328.0 ±0.0 (328-328) LO time (µs) 0.0 ±0.0 (0-0) 13874.9 ±2006.0 (11071-16489) 1738.3 ±323.7 (917-2428) I (%) 86.2 ±0.1 (86-87) 86.1 ±0.1 (86-86) 86.1 ±0.0 (86-86) boat Samp 6.2 ±2.5 (4-15) 6.2 ±2.5 (4-15) 4.1 ±0.4 (4-7) Time(ms) 2.6 17.8 5.9 Error 1.30 ±0.14 (1.1-2.1) 1.23 ±0.01 (1.2-1.2) 1.24 ±0.00 (1.2-1.2) LO count 0.0 ±0.0 (0-0) 1.0 ±0.0 (1-1) 1.7 ±0.7 (1-3) I 450.0 ±3.5 (428-451) 451.0 ±0.0 (451-451) 451.0 ±0.0 (451-451) LO time (µs) 0.0 ±0.0 (0-0) 16342.6 ±2310.4 (13755-19648) 2094.9 ±347.1 (1090-3084) I (%) 87.2 ±0.7 (83-87) 87.4 ±0.0 (87-87) 87.4 ±0.0 (87-87) Samp 8.3 ±4.2 (4-22) 8.3 ±4.2 (4-22) 4.1 ±0.3 (4-6) Brussels Time(ms) 3.4 20.6 6.6 Error 1.39 ±0.37 (1.1-3.3) 1.24 ±0.00 (1.2-1.2) 1.24 ±0.00 (1.2-1.3) LO count 0.0 ±0.0 (0-0) 1.0 ±0.0 (1-1) 1.8 ±0.7 (1-3) I 840.1 ±9.8 (808-848) 846.2 ±0.4 (846-847) 846.0 ±0.0 (846-846) LO time (µs) 0.0 ±0.0 (0-0) 24032.7 ±2219.6 (21845-29919) 4274.4 ±834.5 (1794-6007) I (%) 89.9 ±1.1 (87-91) 90.6 ±0.0 (91-91) 90.6 ±0.0 (91-91) graf Samp 7.3 ±3.5 (3-20) 7.3 ±3.5 (3-20) 3.2 ±0.7 (3-8) Time(ms) 4.8 29.5 11.9 Error 1.69 ±0.22 (1.4-2.7) 1.45 ±0.00 (1.4-1.5) 1.45 ±0.01 (1.4-1.6) LO count 0.0 ±0.0 (0-0) 1.0 ±0.0 (1-1) 1.7 ±0.7 (1-4) I 89.6 ±2.4 (77-93) 91.0 ±0.2 (91-92) 91.0 ±0.2 (90-92) LO time (µs) 0.0 ±0.0 (0-0) 8090.3 ±1025.8 (7196-10973) 707.8 ±131.0 (437-1009) I (%) 48.4 ±1.3 (42-50) 49.2 ±0.1 (49-50) 49.2 ±0.1 (49-50) Samp 110.6 ±53.7 (45-257) 54.0 ±4.0 (45-67) 46.1 ±14.8 (37-123) notredame13 sym Time(ms) 4.2 11.7 5.9 Error 1.81 ±0.62 (1.1-4.7) 1.13 ±0.02 (1.1-1.2) 1.15 ±0.09 (1.1-1.7) LO count 0.0 ±0.0 (0-0) 1.0 ±0.0 (1-1) 2.9 ±1.2 (1-7) Table 2: Results on six pairs representing well the whole dataset with the exception of cases in Tab.4. The number of inliers found (I), the inlier ratio I(%), the LO step time (LO time), the number of RANSAC samples (Samp), CPU time (time), the mean error on ground truth correspondences (Error) and the number of local optimizations (LO). Values in bold are means over 100 runs. The ± entries are standard deviations, minimum and maximum are shown in parentheses. The blue plots represent the stability of each algorithm over 100 runs. The left one represents a probability of a tentative correspondence to be an inlier (probability on the vertical axis, correspondence index on the horizontal axis). The correspondences were sorted so that the plot is non-increasing. In the ideal case, the plot should look like a rectangle. Any other shape indicates that some of the tentative correspondences were not classified as inliers/outliers consistently over the 100 runs. The second plot is a histogram of the first plot. First image Second image Thr. sensitivity image:05 1.8 1.6 1.4 1.2 1 0.8 mean GT error [px] 5 10 15 20 inlier threshold [px] image:boat 1.4 1.3 1.2 mean GT error [px] 5 10 15 20 inlier threshold [px] image:sym_notredame15 2.5 2 1.5 1 mean GT error [px] 5 10 15 20 inlier threshold [px] image:adam 3 2.5 2 1.5 1 mean GT error [px] 5 10 15 20 inlier threshold [px] image:Brussels 1.8 1.6 1.4 mean GT error [px] 5 10 15 20 inlier threshold [px] image:graf 2 1.8 1.6 mean GT error [px] 5 10 15 20 inlier threshold [px] Table 3: The dependence of the ground truth error on the inlier threshold (RANSAC green, LO-RANSAC blue, L1-based RANSAC red). Note that the proposed L1 algorithm yields results very similar to LO-RANSAC. The ground truth error was averaged over 10 runs for each of the methods. Experimental results demonstrated on the same image pairs as in Table 2. The RANSAC parameters common to all three where S = max(w, h)/768 is a scale factor depen- tested versions used in our experiments are sum- dent image dimensions. The 5.99 term is the 95% marized in table 6. The inlier threshold θ is set, percentile of the χ2 distribution with two degrees of following[8] given σ in the following way: freedom. Additional parameters used for the standard LO- θ = 5.99 (σS)2 RANSAC are summarized in Table 7. The proposed Image Qty↓ RANSAC LO-RANSAC L1-based RANSAC I 201.0 ±12.9 (172-234) 227.2 ±1.3 (224-232) 214.9 ±11.4 (195-228) LO time (µs) 0.0 ±0.0 (0-0) 12567.0 ±1944.9 (10166-16251) 993.8 ±187.2 (598-1650) I (%) 60.0 ±3.9 (51-70) 67.8 ±0.4 (67-69) 64.1 ±3.4 (58-68) Samp 52.6 ±24.8 (15-153) 42.5 ±10.9 (15-59) 17.7 ±5.7 (9-41) BruggeSquare Time(ms) 5.2 18.0 5.8 Error 3.50 ±1.25 (1.2-6.2) 2.44 ±0.12 (2.0-2.7) 2.93 ±0.91 (1.3-4.6) LO count 0.0 ±0.0 (0-0) 1.0 ±0.0 (1-1) 2.7 ±1.2 (1-5) I 11.0 ±0.2 (9-11) 11.0 ±0.0 (11-11) 10.9 ±0.6 (7-11) LO time (µs) 0.0 ±0.0 (0-0) 1531.3 ±746.7 (603-4434) 249.7 ±68.9 (109-419) y I (%) 14.8 ±0.3 (12-15) 14.9 ±0.0 (15-15) 14.7 ±0.8 (9-15) Samp 10745.0 ±5429.3 (6963-27356) 8392.7 ±3432.8 (6963-25008) 8220.6 ±3301.0 (4820-19947) dlazk Time(ms) 87.3 76.0 71.6 Error 2.99 ±0.95 (2.6-6.4) 2.61 ±0.00 (2.6-2.6) 5.43 ±20.63 (2.6-204.9) LO count 0.0 ±0.0 (0-0) 4.7 ±1.6 (1-9) 7.0 ±3.1 (2-21) Table 4: Results on two image pairs with unusual sensitivity to the inlier threshold. See caption of Tab. 2 for description of entries. image:BruggeSquare 6 5 4 3 mean GT error [px] 5 10 15 20 inlier threshold [px] image:dlazky 100 80 60 40 20 mean GT error [px] 5 10 15 20 inlier threshold [px] Table 5: The dependence of the ground truth error on the inlier threshold (RANSAC green, LO-RANSAC blue, L1-based RANSAC red) for two failure cases. confidence 0.95 representing the results on the whole dataset, with σ 2.0 the exception of a few cases described later. Note sample limit 500000 that the proposed L1 optimization is usually about 5 times faster than the standard LO step (see ’LO Table 6: RANSAC parameters time’).Table 4 summarizes the performance on the few exceptional cases. ILSQ iterations 4 The error (see ’Error’ in the table) was computed ILSQ sample limit 28 by reprojecting hand-made ground truth correspon- threshold multiplier 4 dences (about 8 of them for each image pair) by the inner RANSAC repetitions 10 model found by the algorithm used. Table 7: LO-RANSAC parameters Two observations summarize the results: i) the proposed procedure yields error which is compara- ble to the standard LO-RANSAC, and ii) it usually method does not introduce any new parameters. runs approximately 5 times faster (see ’LO time’ in Table 2 shows a sample of six image pairs well the table). Table 3 shows the comparison of the dependence and Pattern Recognition (CVPR), 2012 IEEE Con-of the error on the inlier threshold for standard ference on, pages 206–213. IEEE, 2012. 3 RANSAC, standard LO-RANSAC and the proposed [7] P. Huber. Robust Statistics. Wiley Series in Probabil- method. The results shown on the same subset of ity and Statistics - Applied Probability and Statistics six image pairs which are representative of the whole Section Series. Wiley, 2004. 3 dataset. The experiment confirms that the proposed [8] K. Lebeda, J. Matas, and O. Chum. Fixing the lo- cally optimized ransac. In R. Bowden, J. Collo- procedure is able to achieve results similar to the mosse, and K. Mikolajczyk, editors, Proceedings of standard LO-RANSAC. the British Machine Vision Conference, pages 1013– The results for two exceptional image pairs are 1023, London, UK, September 2012. BMVA. 1, 2, shown in table 5. The standard LO-RANSAC 5 achieves good results (high stability, low error), [9] K. Lebeda, J. Matas, and O. Chum. Fixing the lo- while our proposed algorithm fails to stabilize the cally optimized RANSAC – Full experimental eval- plain RANSAC results (the ’dlazky’ pair is one of uation. Research Report CTU–CMP–2012–17, Cen- the most challenging ones from our dataset, as there ter for Machine Perception, Czech Technical Univer- are only 11 inliers). sity, Prague, Czech Republic, September 2012. 3 [10] R. Raguram, O. Chum, M. Pollefeys, J. Matas, and 4. Conclusions J. Frahm. Usac: A universal framework for random sample consensus. Pattern Analysis and Machine In- We have shown that replacing the standard LO telligence, IEEE Transactions on, 35(8):2022–2038, step of LO-RANSAC with minimization of the sum Aug 2013. 1 of Huber kernel responses to residuals has the fol- [11] B. Tordoff and D. W. Murray. Guided sampling and lowing properties: it is simple, deterministic and pro- consensus for motion estimation. In Computer Vi- duces similar errors as the standard LO-RANSAC sionECCV 2002, pages 82–96. Springer, 2002. 1 and is usually approximately 5 times faster. On the [12] E. Weiszfeld and F. Plastria. On the point for which negative side, in the current implementation, it has the sum of the distances to n given points is mini- mum. Annals of Operations Research, 167(1):7–41, higher probability of failure than the standard LO- 2009. 3 RANSAC. [13] Z. Zhang. Parameter estimation techniques: A tu- Acknowledgements torial with application to conic fitting. Image and vision Computing, 15(1):59–76, 1997. 3 The research was supported by CTU student grant SGS15/155/OHK3/2T/13. References [1] A. Beck and S. Sabach. Weiszfeld’s method: Old and new results. Journal of Optimization Theory and Applications, 164(1):1–40, 2014. 2 [2] O. Chum, J. Matas, and J. Kittler. Locally opti- mized ransac. In Pattern Recognition, pages 236– 243. Springer, 2003. 1 [3] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with ap- plications to image analysis and automated cartog- raphy. Commun. ACM, 24(6):381–395, June 1981. 1 [4] L. V. G. H. Shao, T. Svoboda. Zubud - zurich build- ings database for image based recognition. Techni- cal report, 2003. 3 [5] R. Hartley and A. Zisserman. Multiple View Geome- try in Computer Vision. Cambridge University Press, Cambridge, UK, 2000. Chapter 3 : Estimation - 2D Projective Transformations. 1, 2 [6] D. C. Hauagge and N. Snavely. Image matching us- ing local symmetry features. In Computer Vision