Proceedings	of	the	21st	Computer	Vision	Winter

Workshop

Luka	Čehovin,	Rok	Mandeljc,	Vitomir	Štruc	(eds.)

Rimske	Toplice,	Slovenia

February	3-5,	2016

Proceedings	of	the	21st	Computer	Vision	Winter	Workshop

February	3-5,	2016,	Rimske	Toplice,	Slovenia

©	Slovenian	Pattern	Recognition	Society,	Ljubljana,	February	2016

Volume	Editors:	Luka	Čehovin,	Rok	Mandeljc,	Vitomir	Štruc

Publisher

Slovenian	Pattern	Recognition	Society,	Ljubljana	2016

Electronic	edition

Slovenian	Pattern	Recognition	Society,	Ljubljana	2016

©	SDRV	2016

CIP	-	Kataložni	zapis	o	publikaciji

Narodna	univerzitetna	knjižnica,	Ljubljana

004.93(082)(086.034.4)

004.8(082)(086.034.4)

COMPUTER	Vision	Winter	Workshop	(21	;	2016	;	Rimske	Toplice)

Proceedings	of	the	21st	Computer	Vision	Winter	Workshop,	Rimske	Toplice,	Slovenia,	February	3-5,	2016	[Elektronski	vir]

/	Luka	Čehovin,	Rok	Mandeljc,	Vitomir	Štruc	(eds.).	-	Electronic	ed.	-	Ljubljana	:	Slovenian	Pattern	Recognition	Society, 2016

ISBN	978-961-90901-7-6

1.	Čehovin,	Luka

283229440

Maloprodajna	cena:	19,99	€

Message	from	the	program	chairs

It	is	our	pleasure	and	privilege	to	welcome	you	to	the	21st	Computer	Vision	Winter	Workshop	(CVWW2016).	This	year the	workshop	is	organized	by	the	Slovenian	Pattern	Recognition	Society	(SPRS),	and	held	in	Rimske	Toplice,	Slovenia, from	of	February	3rd	to	February	5th,	2016.	We	hope	that	your	experience	at	CVWW	is	both	professionally	and personally	rewarding!

The	Computer	Vision	Winter	Workshop	(CVWW)	is	an	annual	international	meeting	of	several	computer	vision research	groups,	located	in	Ljubljana,	Prague,	Vienna,	and	Graz.	The	aim	of	the	workshop	is	to	foster	interaction	and exchange	of	ideas	among	researchers	and	PhD	students.	The	focus	of	the	workshop	spans	a	wide	variety	of computer	vision	and	pattern	recognition	topics,	such	as	image	analysis,	medical	imaging,	3D	vision,	human-computer	interaction,	vision	for	robotics,	machine	learning,	as	well	as	applied	computer	vision	and	pattern recognition.

CVWW	2016	received	a	total	of	23	submissions	from	six	countries.	The	paper	selection	was	coordinated	by	the Program	Chairs,	and	included	a	rigorous	double-blind	review	process.	The	international	Technical	Program	Committee consisted	of	39	renowned	computer	vision	experts,	who	conducted	the	review.	Each	submission	was	examined	by	at least	three	experts,	who	were	asked	to	comment	on	the	strengths	and	weaknesses	of	the	papers	and	justify	their recommendation	for	accepting	or	rejecting	a	submission.	The	Program	Chairs	used	the	reviewers'	comments	to render	the	final	decision	on	each	paper.	As	a	result	of	this	review	process,	8	papers	were	accepted	for	oral presentation,	and	6	papers	were	accepted	for	presentation	in	the	form	of	a	poster.	Authors	of	the	accepted	posters were	also	given	the	opportunity	to	present	their	work	in	the	form	of	short	one-minute	talks	at	a	designated	spotlight session.	8	papers	were	accepted	for	presentation	at	the	workshop	in	the	form	of	 invited	presentations	of	on-going

work	(6	orals	and	2	posters),	and	are	not	included	in	the	proceedings	to	avoid	conflicts	with	potential	future submissions	of	the	presented	material.	The	Program	Chairs	would	like	to	thank	all	reviewers	for	their	high-quality and	detailed	comments,	which	served	as	a	valuable	source	of	feedback	for	all	authors,	and	most	of	all	for	their	time and	effort,	which	helped	to	make	the	CVWW2016	a	success.

The	workshop	program	included	an	invited	talk	by	dr.	Mario	Fritz	(Laboratory	for	Autonomous	Intelligent	Systems, Department	of	Computer	Science,	University	of	Freiburg),	to	whom	we	thank	for	his	participation.	We	also	extend	our thanks	to	the	Slovenian	Pattern	Recognition	Society,	through	which	the	workshop	was	organized.

CVWW	2016	benefits	from	its	sponsors;	and	we	want	to	acknowledge	and	thank	our	supporters	from	KOLEKTOR	and the	Faculty	of	Computer	and	Information	Science	for	their	contributions.	To	all	the	sponsors	and	their	representatives in	attendance,	thank	you!

We	hope	that	the	21st	iteration	of	the	Computer	Vision	Winter	Workshop	is	a	productive	and	enjoyable	meeting	for you	and	your	colleagues,	and	inspires	new	ideas	that	can	advance	your	professional	activities.

Welcome	and	thank	you	for	your	participation!

Luka	Čehovin,	Rok	Mandeljc,	Vitomir	Štruc

CVWW2016	Program	Chairs

Ljubljana,	Slovenia,	January	2016

Committes

PROGRAM	CHAIRS

Luka	Čehovin	(FRI	University	of	Ljubljana)

Rok	Mandeljc	(FRI,	FE	University	of	Ljubljana)

Vitomir	Štruc	(FE	University	of	Ljubljana)

PROGRAM	COMMITTEE

Csaba	Beleznai

Stanislav	Kovacic

Rene	Ranftl

Horst	Bischof

Matej	Kristan

Daniel	Prusa

Jan	Cech

Walter	Kropatsch

Peter	Roth

Ondrej	Chum

Vincent	Lepetit

Robert	Sablatnig

Ondrej	Drbohlav

Jiri	Matas

Radim	Sara

Boris	Flach

Martin	Matousek

Walter	Scheirer

Vojtech	Franc

Mirko	Navara

Alexander	Shekhovtsov

Friedrich	Fraundorfer

Tomas	Pajdla

Danijel	Skocaj

Margrit	Gelautz

Peter	Peer

Tomas	Svoboda

Michal	Havlena

Roland	Perko

Peter	Ursic

Yll	Haxhimusa

Janez	Pers

Tomas	Vojir

Václav	Hlaváč

Roman	Pfugfelder

Andreas	Wendel

Ines	Janusch

Thomas	Pock

Paul	Wohlhart

Contents

1.	 Towards	a	Visual	Turing	Test:	Answering	Questions	on	Images	(invited	talk)	[Abstract]

Mario	Fritz

2. A	Longitudinal	Diffeomorphic	Atlas-Based	Tissue	Labeling	Framework	for	Fetal	Brains	using	Geodesic Regression	[PDF]

Roxane	Licandro*,	Georg	Langs,	Gregor	Kasprian,	Robert	Sablatnig,	Daniela	Prayer,	and	Ernst	Schwartz	(Vienna University	of	Technology)

3. Quantitative	Comparison	of	Feature	Matchers	Implemented	in	OpenCV3	[PDF]

Zoltan	Pusztai	(Eörvös	Loránd	University)	and	Levente	Hajder*	(MTA	SZTAKI) 4. Real-Time	Eye	Blink	Detection	using	Facial	Landmarks	[PDF]

Tereza	Soukupova*	and	Jan	Cech	(Czech	Technical	University	in	Prague)

5. Solving	Dense	Image	Matching	in	Real-Time	using	Discrete-Continuous	Optimization	[PDF]

Alexander	Shekhovtsov*,	Christian	Reinbacher,	Gottfried	Graber,	and	Thomas	Pock	(Graz	University	of Technology)

6. Touching	without	vision:	terrain	perception	in	sensory	deprived	environments	[PDF]

Vojtěch	Šalanský*,	Vladimír	Kubelka,	Karel	Zimmermann,	Michal	Reinštein,	and	Tomas	Svoboda	(Czech	Technical University	in	Prague)





7. Hessian	Interest	Points	on	GPU	[PDF]

Jaroslav	Sloup,	Jiri	Matas,	Michal	Perdoch,	Stepan	Obdrzalek*	(Czech	Technical	University	in	Prague) 8. BaCoN:	Building	a	Classifier	from	only	N	Samples	[PDF]

Georg	Waltner*,	Michael	Opitz,	Horst	Bischof	(Graz	University	of	Technology) 9. Cuneiform	Detection	in	Vectorized	Raster	Images	[PDF]

Judith	Massa,	Bartosz	Bogacz*,	Susanne	Krömker,	Hubert	Mara	(University	Heidelberg) 10. 2D	tracking	of	Platynereis	dumerilii	worms	during	spawning	[PDF]

Daniel	Pucher*,	Walter	Kropatsch,	Nicole	Artner,	Stephanie	Bannister	(Vienna	University	of	Technology) 11. Significance	of	Colors	in	Texture	Datasets	[PDF]

Milan	Šulc*,	Jiri	Matas,	(Czech	Technical	University	in	Prague)

12. A	Novel	Concept	for	Smart	Camera	Image	Stitching	[PDF]

Hanna	Huber,	Majid	Banaeyan*,	Raphael	Barth,	Walter	Kropatsch	(Vienna	University	of	Technology) 13. A	concept	for	shape	representation	with	linked	local	coordinate	systems	[PDF]

Manuela	Kaindl*,	Walter	Kropatsch	(Vienna	University	of	Technology)

14. A	Computer	Vision	System	for	Chess	Game	Tracking	[PDF]

Can	Koray*,	Emre	Sumer	(Başkent	University)

15. Fast	L1-based	RANSAC	for	homography	estimation	[PDF]

Jonáš	Šerých*,	Jiri	Matas,	Ondrej	Drbohlav	(Czech	Technical	University	in	Prague) Invited	talk

Towards	a	Visual	Turing	Test:	Answering	Questions	on	Images

Mario	Fritz

Max	Planck	Institute	for	Informatics	and	Saarland	University

Abstract

We	address	the	task	of	automatically	answering	questions	on	images	by	bringing	together	latest	advances	from natural	language	processing	and	computer	vision.	In	order	to	quantify	progress	on	this	challenging	problem,	we have	established	the	first	benchmark	for	this	challenging	problem	that	can	be	seen	as	a	modern	attempt	at	a visual	turing	test.	Our	first	approach	to	this	problem	follows	a	more	traditional	AI	approach,	where	we	combine discrete	reasoning	with	uncertain	predictions	by	a	multi-world	approach	that	models	uncertainty	about	the perceived	world	in	a	bayesian	framework.	More	recently,	we	build	on	the	success	of	deep	learning	techniques and	propose	an	end-to-end	formulation	of	this	problem	for	which	all	parts	are	trained	jointly.	Looking	forward, we	see	these	two	approach	as	two	ends	of	a	spectrum	ranging	from	symbolic	representations	to	vector-based embedding	that	we	are	currently	exploring.

Sponsors





21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

A Longitudinal Diffeomorphic Atlas-Based Tissue Labeling Framework for

Fetal Brains using Geodesic Regression

Roxane Licandro1,2

licandro@caa.tuwien.ac.at

1 Institute of Computer Aided Automation, Computer Vision Lab, Vienna University of Technology,

http://www.caa.tuwien.ac.at/cvl

2 Department of Radiology and Image-guided Therapy, Computational Imaging Research Lab, Medical University of Vienna,

http://www.cir.meduniwien.ac.at

Georg Langs2

Gregor Kasprian2

Robert Sablatnig1

Daniela Prayer2

Ernst Schwartz2

Abstract. The human brain undergoes structural

tal brain during pregnancy, a single map is not suf-

changes in size and in morphology between the sec-

ficient to model brain development [19]. Changes

ond and the third trimester of pregnancy, corre-

in size, according to accelerated growth, changes in

sponding to accelerated growth and the progress of

morphology, due to the progress of cortical folding

cortical folding. To make fetal brains comparable,

and deceleration of the proliferation of ventricular

spatio-temporal atlases are used as a standard space

progenitor cells [16] occur and are illustrated in Fig-

for studying brain development, fetal pathology loca-

ure 1a. Thus, a collection of brain maps is needed

tions, fetal abnormalities or anatomy. The aim of this

to describe these alterations as a function of time.

work is to provide a continuous model of brain devel-

For studying the brain organisation during its de-

opment and to use it as base for an automatic tissue

velopment, abnormalities, and locations of patholo-

labeling framework. This paper provides a novel lon-

gies, brain maps are used as a reference model [18].

gitudinal fetal brain atlas construction concept for

Newly acquired brain images are labelled to iden-

geodesic image regression using three different age-

tify structures and possible abnormal changes or to

ranges which are parametrized according to the de-

find indicators for diseases. This labeling can be per-

velopmental stage of the fetus. The dataset used for

formed manually by annotating the images, which

evaluation contains 45 T2−weighted Magnetic Res-

needs an expert, time and consequently leads to in-

onance (MR) volumes between Gestation Week (GW)

creased costs compared to an automatic labeling pro-

18.0 and GW 30 day 2. The automatic tissue label-

cedure [3]. In this case, labels for non annotated

ing framework estimates cortical segmentations with

images are estimated automatically by software us-

a Dice Coefficient (DC) of up to 0.85 and ventricle

ing a brain model for the mapping. Such an auto-

segmentations with a DC of up to 0.60.

mated labeling procedure on the one hand and a ref-

erence model on the other form an atlas. To cover the

time-dependent development of the fetal brain, time-

1. Introduction

varying reference models are considered for building

spatio-temporal atlases.

The aim of brain mapping experiments is to cre-

ate maps (models), based on studies, to understand

1.1. State-of-the-Art

structural and functional brain organization. To this

end, neuroimaging methods as well as knowledge of

State-of-the-art approaches [8, 10, 13, 17, 21] for

neuroanatomy and physiology are combined. Due to

computing a spatio-temporal atlas combine registra-

the fundamental changes occurring in the human fe-

tion methods and interpolation techniques to obtain





(a) Fetal Brain Development

(b) Observable Brain Structures

Figure 1: Left: MR imaging and schematic illustration of the fetal brain development at GW 20, 23 day 3 and 30 day 2. Right: Illustration of identifiable brain structures in a T2 weighted fast MR image acquired with a 1.5 Tesla scanner (Grey Matter (GM), White Matter (WM), the VENTricles (VENT) and the Germinal MATrix (GMAT) [21]). Also extraventricular Cerebro Spinal Fluid (CSF), Deep Grey Matter (DGM) and Non-Brain structures (NB), like skull or amniotic fluid are identifiable. MR images courtesy of Medical University of Vienna (MUW).

continuity in time. The use of an ”all-to-one” ap-

imaging technique is used as an alternative to ultra-

proach (a single subject as reference) introduces sub-

sonography for prenatal diagnosis and is able to im-

stantial bias. The brain structures of fetuses can-

age a fetus in a non-invasive way. Distinguishable

not be described by one image, since it does not re-

brain structures using this technique are illustrated in

flect occurring changes over time [10, 17]. Exclu-

Figure 1b. A problem of MR imaging is the lack of

sive pairwise affine registration for image alignment

comparability and constancy of gray-values. Thus,

results in blurred regions in the templates obtained

for the comparison of brains of adult patients, an at-

by intensity averaging. Affine registration is not ca-

las as a standard space is required, which avoids the

pable of compensating local inter-subject variabil-

gray-value discrepancies. The brains are mapped to a

ity [17]. This leads to worse registration results be-

standardized coordinate system according to marked

tween atlas-based segmentations and individual ob-

anatomical locations. However, the fetal brain is a

jects compared to non-rigid approaches, which show

developing structure. In comparison to building an

a higher level of detail [17]. An advantage of pair-

atlas of an adult brain, the fast change of a fetal brain

wise approaches lies in the registration of wider age-

in shape and size has to be taken into account [10].

ranges between 15 to 18 Gestation Weeks (GW),

Also, fetal brains at a certain GW show differences in

compared to groupwise approaches, which are able

orientation shape and size. Possible reasons are the

to cover only small age ranges between 5 to 8 GW.

inaccuracy in determination of the gestational age,

A benefit of groupwise registration approaches is

inter-patient variability or pathological growth pro-

the template-free estimation of the initial reference

cesses [15]. The motivation for building a fetal atlas

space. The template is estimated and updated dur-

is the possibility to compare fetal brains for study-

ing the registration procedure [10]. The main limita-

ing brain development, fetal pathology locations, fe-

tions of groupwise registration lie in the lower level

tal abnormalities or anatomy.

of anatomic definition [17]. Examples for pairwise

approaches can be found in [10, 17] and for group-

1.3. Contribution

wise approaches in [8, 13, 21].

We create a tissue labeling framework for corti-

cal and ventricle structures in the fetal brain from

1.2. Challenges

GW 18 to GW 30. An automatic segmentation pro-

Imaging of a fetus in utero is challenging, due to

cedure including a longitudinal fetal brain atlas and

its constantly changing position, which causes image

a labeling procedure are considered. In our work

unsharpness and artefacts [5]. Thus, a main issue in

we demonstrate that image regression is capable to

fetal imaging lies in shortening the image acquisi-

build a spatio-temporal atlas of the fetal brain and

tion time to 20 seconds and to use motion correc-

is able to model a mean trajectory encoding the

tion techniques [4]. The Magnetic Resonance (MR)

brain development in a single diffeomorphic defor-

mation, instead of calculating discrete age-dependent

tion = 0.78 - 0.9 pixels per mm, Slice thickness =

templates combined with interpolation.

As found

3 - 4.4mm, Acquisition matrix = 256×256, Field

in literature [7, 9, 11], image regression for time-

of view = 200 - 230mm, Specific Absorption Rate

series data have been evaluated only using adult- and

(SAR) = < 100% /4.0W/kg, Image acquisition time

child-brain datasets, which record changes of brain

= ≤ 20s, TE (Echo Time) = 100 - 140ms, TR (Rep-

structure over time. In the proposed work the lo-

etition Time) = 9000 - 19000ms. The dataset of MR

cal inter-subject variability is considered to be mod-

images used for atlas learning are preprocessed using

elled continuously in time and non-rigidly in space

the pipeline illustrated in Figure 2. First the images

by geodesic regression [1, 2]. The computed atlas

are motion corrected using the toolkit for fetal brain

is used as a prior of the Graph Cut (GC) approach

MR images published by Rousseau et al. [14]. Sub-

for multi label segmentation proposed by Yuan et

sequently, the manual annotation of the cortex, left

al. [20].

and right eye, ventricle and occipital foramen mag-

The paper is organized as follows. In Section 2 an

num is performed by an expert. After this step, the

overview of the methodology used and the concept

data is rigidly aligned, the surrounding mother tissue

of the tissue labeling framework proposed is pre-

is excluded in a masking step and the volumes are

sented. The results and the corresponding discussion

cropped to reduce computational costs in the longi-

are given in Section 3. This work concludes with a

tudinal registration procedure using a bounding box

summary of the contributions in Section 4.

of size 90 × 140 × 140 voxels.

2.2. Spatio Temporal Atlas Learning

2. Methodology

The algorithm used for Diffeomorphic Anatom-

The framework proposed is illustrated in Figure 2.

ical RegistraTion using Exponential Lie algebra

The input represents a gray value image Inew at time

(DARTEL) of Ashburner et al. [1, 2] for geodesic

point tnew, which is preprocessed in a first step, by

regression is integrated in the Statistical ParaMetric

performing motion correction, rigid alignment, im-

(SPM) tool box - release SPM8 1. This approach

age masking and image cropping. Subsequently, the

is used to encode the brain development in a single

longitudinal diffeomorphic fetal brain atlas is used to

diffeomorphic deformation by optimising the energy

estimate a time point tnew corresponding diffeomor-

term E expressed in Equation 1 [2].

phic transformation for computing a time-dependent

N

intensity image I

Z



A and a time-dependent segmenta-

1

1 X

E =

kLv

kI − I (ϕ )k2 dx

tion for ventricular and cortical tissue Stissue in atlas

0k2+

t0

tn

tn

A

2

2 n=1

space. In a pairwise registration procedure, a trans-

x∈Ω

(1)

formation T from the preprocessed input (Aligned

The term ϕ

denotes the forward deformation from

I

tn

new ) to the atlas-based intensity image IA is esti-

source I

to target I

at time point t

mated. The inverse of the computed transformation

t0

tn

n, where n =

1, . . . , N and L represents a model of the ”inertia”

T −1 is used to transform the atlas based segmenta-

of the system, i.e. a linear operator which operates

tions Stissue to the subject’s space (Stissue ◦ T −1 =

A

A

on a time-dependent velocity that mediates the defor-

Stissue). As next step the transformed segmentations

GC

mation over unit time [2]. It is introduced to derive

Stissue and I

GC

new are used as input parameters for the

an initial momentum m

multi label GC segmentation refinement. The output

0 through an initial velocity

v

of the framework are segmentations for ventricular

0. The velocity field v(x) learned at position x is

parametrised using a linear combination of i basis

and cortical brain tissues Stissue

new

of the input image

functions. Such basis functions consist of a vector

Inew.

of coefficients ci and a ith first degree B-spline basis

2.1. Image Acquisition and Preprocessing

function ρi(x) (cf. Equation 2) [1].

X

The time series MR image dataset used consists

v(x) =

ciρi(x)

(2)

of 45 healthy fetal brains with an age range between

i

18 and 30 GW. The MR image acquisition is per-

The aim of the DARTEL implementation is to esti-

formed using an 1.5 Philips Gyroscan superconduct-

mate an optimized parametrisation of c. The energy

ing unit scanner performing a single-shot, fast spin-

1http://www.fil.ion.ucl.ac.uk/spm/;

echo T2-weighted MR sequence: In-plane resolu-

[accessed 07 December 2015]



Figure 2: Fetal brain tissue labeling framework. MR images courtesy of MUW.

cost term E in Equation 1 is reformulated in terms

term µ encodes the variance according to symmetric

of finding the coefficients of c for a given dataset D

components, rotations and the penalisation of scaling

with maximum probability (cf. Equation 3). A maxi-

and shearing. The likelihood term encodes the prob-

mization of the probability leads to the minimization

ability of c given the data D [1] and corresponds to

of its negative logarithm and thus, is used to interpret

the mean-squared difference between a warped tem-

registration of data D as a minimization procedure

plate deformed by the calculated transformation and

of the objective function (− log p(c, D)) expressed

the target image.

in Equation 3, consisting of a prior term (− log p(c))

and a likelihood term (− log p(D|c)) [1].

2.2.1

Optimisation Procedure

− log p(c, D) = − log p(c) − log p(D|c)

(3)

A Full Multi Grid (FMG) approach is used to solve

the equation (cf. Equation 4) which is needed to up-

The prior term denotes the prior probability p(c).

date the vector field during a Gauß-Newton opti-

Ashburner et al. [1] use a concentration matrix (in-

mising procedure, where Hiter denotes the Hessian,

verse of a covariance matrix) K to encode spa-

giter the gradient and K the concentration matrix.

tial variability.

The parameters [λ

Details regarding the computation of viter+1 are ex-

1, λ2, λ0, λ, µ],

0

plained in [1, 2].

which have to be predefined to compute K, influence

the behaviour of the deformation (bending energy,

viter+1 = viter − (K + Hiter)−1 (Kviter + giter) (4)

0

0

0

stretching, shearing) as well as the divergence and

amount of volumetric expansion or contraction [1].

For this task images are observed in different scales.

The term λ0 encodes the penalisation of absolute dis-

For every resolution level multigrid methods recur-

placements, λ1 penalises the difference between two

sively estimate the field, starting at the coarsest scale

neighboured vectors by observing the first derivatives

and computing the residual to solve the update equa-

(linear term) of the displacements, λ2 penalises the

tions on the current grid. Subsequently, the solution

difference between the first derivatives of two neigh-

is prolongated to the next finer grid [1].

boured vectors by observing the second derivatives

2.3. Automatic Tissue Labeling using Graph Cuts

of the displacements and λ denotes the variability of

the spatial locations (divergence of each point in the

For tissue labeling, we use a continuous max flow

flow field) with a constant value. Increasing λ leads

formulation of a multi label GC [20]. Three input

to increasing smoothing of the flow vector field and

parameters are necessary for performing tissue seg-

preserves volumes during the transformation. The

mentation. A data term (gray value volume Inew

at age tnew), a cost (unary) term, and a penalty

day 3 (164 GD) to 26 GW day 2 (184 GD) and age

(binary) term. For computing a unary term, atlas

range 3 from 26 GW day 2 (184 GD) to 30 GW day 2

based segmentations for cortex and ventricle tissue

(212 GD). The first part of the evaluation documents

Stissue = {Scortex, Sventricle} at age t are estimated

the atlas learning results for each age range. Subse-

and smoothed with a Gaussian filter G. The parame-

quently, the atlases computed are used to evaluate the

ter δ is defined to weight the smoothed result with a

tissue labeling procedure as a second part of the eval-

constant factor. The unary term is illustrated in Equa-

uation. Estimated atlas templates at the testing time-

tion 5, where ? denotes the convolution operator.

point are pairwise registered to the test MR volume to

obtain a transformation T . The inverse T −1 is used

C = δ ∗ (Stissue ? G)

(5)

to transform the atlas based segmentation to the test-

subject’s space. As last step the segmentation of the

Three different binary terms are evaluated:

test volume using the transformed atlas is computed.

Penalty term 1 (P1) is a weighted norm of the gra-

To evaluate our approach, we report the overlap be-

dient of the data term D (cf. Equation 6), where δ

tween automatic- and manual segmentations of the

denotes the same weighting term as used in Equation

fetal cortex and ventricles. In the leave-one-out cross

5 and a, b are constant weighting parameters.

validation, we compare the Dice Coefficient (DC) [6]

b

between the groundtruth annotation and different au-

P1 = δ ∗

(6)

1 + (a ∗ k∇Dk)

tomatic segmentations based on (1) the atlas, (2) the

transformed atlas, and (3) the GC segmentation opti-

Penalty term 2 (P2) denotes an intensity based term

mization.

and is calculated separately for cortex and ventri-

Furthermore, we report the volume of cortex and

cle segmentation (cf. Equation 7). Tissue type spe-

ventricles, and the area of the cortical surface of the

cific gray values are modelled as Gaussian distribu-

atlas based segmentations.

tions N∼(µtissue, σtissue), which parameters µtissue

and σtissue are estimated using the a-priori atlas seg-

3.1. Results Spatio-Temporal Atlas Learning

mentation. These parameters are used to calculate

the probability of every pixel belonging to cortex or

The deformation behaviour of image regres-

ventricle. Subsequently, the gradient of the resulting

sion using 21 different regularisation kernels

probability map P and its norm are computed and

K [λ1, λ2, λ0, λ, µ] (cf. Section 2.2) is evaluated for

weighted by the parameters δ, a, b as shown in Equa-

every age range. Beside the DC also the behaviour

tion 6.

of the regularisation of the volume expansion and

changes of the area of cortical surface have to be

b

P2 = δ ∗

(7)

taken into account, when choosing a suitable ker-

1 + (a ∗ k∇P (µtissue, σtissue)k)

nel. Atlas-based cortical and ventricle segmentations

Penalty term 3 (P3) represents an exponential for-

are studied. According to the evaluation results, ker-



mulation and is expressed in Equation 8. The param-

nel 1 (K1 0.01, 0.01, 9e−6, 1e−5, 1e−5) is chosen

eter u is a constant and v a linear weighting parame-

as suitable regularisation for age range 1, kernel 4



ter. The term w weights the norm of the image’s D

(K4 0.01, 9e−6, 9e−6, 0.01, 1e−5) for age range 2



gradient non-linearly in the exponential term.

and kernel 7 (K7 0.01, 0.01, 9e−6, 0.01, 1e−5) for

age range 3. Figure 3a shows examples of the at-



k∇Dk

P

las templates learned and Figure 3b illustrates the

3 = u + v ∗ exp

−

(8)

w

anatomical details of these at age GW 21 day 4 (GD

151), GW 24 day 3 (GD 171) and GW 29 (GD 203).

3. Results

In both figures the growth of the brain structures is

Evaluation of the proposed framework is per-

observable. The brain model at age range 1 is char-

formed using leave-one-out cross validation. In this

acterised by a smoother cortex surface in compari-

paper a novel longitudinal registration procedure is

son to a brain at a higher age range. It also visu-

formulated by dividing the data set into three age

alises the increase of the cortical folding grade. Ac-

ranges, based on the developmental stage of the fetus.

cording to Pugash et al. [12], the ventricles achieve

Age range 1 reaches from 20 GW day 6 (146 GD) to

their thickest size in early gestation and regress in the

23 GW day 3 (164 GD), age range 2 from 23 GW

third trimester, which is not visible. The regularisa-





ATLAS BASED TEMPLATES AGE RANGE 1

ATLAS BASED TEMPLATES AGE RANGE 2

KERNEL 1

KERNEL 4

GD 148

GD 150 GD 153

GD 156

GD 159 GD 163

GD 164

GD 168

GD 172

GD 177

GD 181

ATLAS BASED TEMPLATES AGE RANGE 3

KERNEL 7

GD 184

GD 190

GD 194

GD 200

GD 205

GD 208

GD 212

(a) Atlas based templates

(b) Details Atlas based templates

Figure 3: Left: Atlas based templates of age range 1, 2 and 3 between GW 21 day 1 (GD 148) and GW 30

day 2 (GD 212). Right: Anatomical details of atlas based templates at age GW 21 day 4 (GD 151), GW 24

day 3 (GD 171) and GW 29 (GD 203). Coronal (first row), axial (second row) and sagital (third row) slices are illustrated. Denoted structures: Sylvian Fissure (SF), InterHemispheric Fissure (IHF), Germinal MATrix (GMAT), Lateral-VENTricle (L-VENT), Cingulate Sulcus (CiS), ColLateral Sulcus (CLS), Cavum of Septum Pellucidum (CSP), Occipital Lobe (OL), Frontal Lobe (FL), Central Sulcus (CeS), PreCentral Gyrus (PreCG), PostCentral Gyrus (PostCG), ParietoOccipital Sulcus (POS) and Calcarine Sulcus (CaS).

tion term for geodesic regression is not able to model

(CaS) and PreOccipital Sulcus (POS).

location specific volume expansion and shrinkage at

3.2. Results Automatic Tissue Labeling

the same time. This leads to worse modelling results

for ventricles, compared to cortical structure, since a

For

pairwise

registration

kernel

A

kernel is chosen which models expansion. Addition-

(K

A 5e−3, 5e−3, 3e−5, 1e−5, 9e−6)

is

used

ally, the subject specific variability of age-dependent

for regularisation.

The DC distributions of seg-

ventricle size in the dataset and the complex form of

mentations of the cortex for age range 1, 2 and

ventricles complicate the determination of a suitable

3 are illustrated in Figure 4 on the top and for

kernel and consequently the registration procedure.

ventricle segmentations on the bottom.

The DC

Observable structures at every age range are Sylvian

distribution of atlas based and transformed atlas-

Fissures (SF), Lateral VENTricle (L-VENT), Inter-

based segmentations using pairwise registration are

Hemispheric Fissure (IHF), Cavum of Septum Pellu-

illustrated and the three dotted lines visualise the

cidum (CSP), Occipital Lobe (OL) and Frontal Lobe

DCs of GC based segmentations computed using

(FL). The SF show in the coronal and axial slices a

penalty terms 1, 2 and 3.

For age range 1 the

smooth bending at age range 1 and develop to a deep

highest DC improvement from 0.727 to 0.771 at

fold at the lateral side of the brain at age range 3.

GD 158 is achieved by pairwise registration and GC

Also the IHF shows a deeper folding at age range

refinement compared to atlas based segmentations.

3 with Cingulate Sulcus (CiS) as additional form-

In contrast to this no improvement is reached at GD

ing compared to age range 1. The Germinal MATrix

151, but shows the highest DC of about 0.851. At

(GMAT) is existent until age range 2 and disappears

GDs older than 154 the GC refining using penalty

later in the third trimester of pregnancy. The Central

1 and penalty 2 achieve a higher DC increase of

Sulcus (CeS) formation starts at age range 2 and gets

about 0.02 compared to using penalty 3.

At age

more apparent at age range 3 as well as the develop-

range 2 no improvement of transformed atlas based

ing of the PreCentral Gyrus (PreCG) and PostCen-

segmentations is observed after pairwise registration,

tral Gyrus (PostCG). The ColLateral Sulcus (CLS) is

which leads to a decrease of the DC. It is observed

visible at age range 3 as well as the Calcarine Sulcus

that the labeling result of the pairwise registration





Age	Range	1

Age	Range	2

Age	Range	3

0,85

EX)

ATLAS

0,75

PW

RT

GC	-	P1

0,65

C	(COD

GC	-	P2

0,55

GC	-	P3

150	151	154	158	164	164	165	170	171	180	184	184	186	191	196	197	199	203	206	208	210

0,6

ATLAS

CLE)	 0,4

PW

TRI 0,2

GC	-	P1

GC	-	P2

C	(VEND 0

GC	-	P3

150	151	154	158	164	164	165	170	171	180	184	184	186	191	196	197	199	203	206	208	210

GESTATIONAL	DAYS

Figure 4: DCs of automatically estimated labels of the cortex and ventricle at age range 1, 2 and 3.

DATA

ATLAS

PW

GC

M

GD	171

GD	203

Figure 5: Top: Coronal view - segmentations of the cortex at GD 171 (GW 24 day 3), bottom: sagital view

- segmentations of the ventricle at GD 203 (GW 29). Segmentations are illustrated estimated by the atlas (ATLAS), after the pairwise registration procedure (PW), estimated by the GC approach (GC) and manual annotations (M).

has an influence on the GC labeling since it acts as

is not capable to compensate differences in volume

initialization of this procedure, best visible at GD

size or absolute displacements.

If an estimated

184. The GC refinement is able to compensate the

segmentation has a bigger volume than the structure

results of the pairwise registration between GD 164

to be segmented or is displaced, then the borders of

and 184 and shows an increase of the DC between

neighboured tissue prevents the GC approach from

atlas and graph-cut based segmentations in average

cutting through regions of a high gradient, since

of about 0.02. At age range 3 an increase of DC at

this would lead to increasing costs in the energy

every age range is achievable using GC refinement.

minimisation procedure. Consequently, the GC is

The highest improvement between atlas-based seg-

not capable to refine the segmentation. In Figure 5

mentations and GC based segmentations is reached

an example for a misaligned segmentation and its

at GD 206 with a DC increase from 0.71 to 0.795.

deformation through the labeling procedure is illus-

The highest DC at age range 3 of about 0.819 is

trated. The displacement is observable at the IHF in

achieved at GD 203 and the lowest of about 0.575 at

the first column and the superior part of the anterior

GD 184. It is observable that pairwise registration

horn of the ventricle in the second column. Test

data and corresponding estimated segmentations,

labeling. Finally the proposed framework is able to

transformed segmentations to subject’s space and

estimate cortex segmentations with a DC up to 0.85

GC based segmentations of the cortex at GD 171

and ventricle segmentations up to 0.60. We show

(top) and of ventricular tissue at GD 203 (bottom)

that image regression is capable to model the vari-

are shown.

The GC segmentations are computed

ability of fetal brains in time and is qualified to be

using the penalty term 3, since it shows the best

used for building a spatio-temporal atlas as basis for

improvement between atlas-based and GC based

fetal brain tissue segmentation. The evaluation of the

segmentations.

cortical labeling results for age range 1, 2 and 3 show

that a single kernel for pairwise registration for every

4. Conclusion

age range is not suitable. Thus, a main focus of future

work will lie in the improvement of the labeling pro-

In this paper an automatic fetal brain tissue label-

cedure, by evaluating age range and tissue dependent

ing framework using geodesic image regression was

regularisation, to improve the quality of graph cut

presented and was identified to be suitable as regis-

based segmentation. Additionally, a combination of

tration approach to longitudinally model the changes

global rigid and local deformable pairwise registra-

of the brain during the 18th and 30th GW. The advan-

tion could be analysed for transforming atlas based

tage is the provision of a time-dependent transforma-

segmentations to the subject’s space as extension to

tion from a source to a target brain, instead of com-

this work.

bining a template building technique and interpola-

tion technique to obtain continuity in time. A novel





Acknowledgements


longitudinal registration scheme was proposed, using

This work was co-funded by ZIT - Life Sciences

separate age ranges for flexible regularisation of the

2014, grant number 1207843, Project Flowcluster,

deformation behaviour due to the age range depen-

and by OeNB (15929).

dent changes. The atlas learned was evaluated us-

ing a leave-one-out cross validation approach for ev-

References

ery age range and 21 different regularisation kernels

were analysed according to their behaviour regard-

[1] J. Ashburner. A fast diffeomorphic image regis-

ing volume expansion, modelling of cortical surface

tration algorithm. NeuroImage, 38(1):95–113,

and Dice similarity to manual annotations. The fe-

Oct. 2007. 3, 4

tal brain atlas proposed is not capable of modelling

[2] J. Ashburner and K. Friston.

Diffeomorphic

the thinning of ventricles from age range 1 to age

registration using geodesic shooting and Gauss-

range 3. Since the proposed method uses one regu-

Newton optimisation. NeuroImage, 55(3):954–

larisation kernel per age range, geodesic regression

967, Apr. 2011. 3, 4

is not able to regularise location specific volume ex-

[3] M. Becker and N. Magnenat-Thalmann. De-

pansion and shrinkage at the same time. To overcome

formable models in medical image segmenta-

this issue, the usage of tissue specific regularisation

tion. In N. Magnenat-Thalmann, O. Ratib, and

and consequently the computation of separate ven-

H. Choi, editors, 3D Multiscale Physiological

tricle atlases are a possible solution. In contrast to

Human, pages 81–106. Springer London, Jan.

this, the increase of the cortical folding grade and of

2014. 1

the volume over time are integrated in the proposed

spatio-temporal model. The quality of transformed

[4] L. Breysem, H. Bosmans, S. Dymarkowski,

atlas based segmentations to subject’s space using

D. V. Schoubroeck, I. Witters, J. Deprest,

pairwise registration leads to the conclusion that the

P. Demaerel, D. Vanbeckevoort, C. Vanhole,

kernel for pairwise registration has to be defined dif-

P. Casaer, and M. Smet. The value of fast MR

ferently according to the age range and also tissue

imaging as an adjunct to ultrasound in prenatal

type, for being able to improve the graph cut initiali-

diagnosis. European Radiology, 13(7):1538–

sation term. Additionally, it is shown that the quality

1548, July 2003. 2

of graph cut labeling is dependent on the initialisa-

[5] M. Clemence. How to shorten MRI sequences.

tion cost term (atlas segmentation) and the penalty

In D. Prayer, editor, Fetal MRI, Medical Radiol-

term. A false or displaced atlas segmentation hinders

ogy, pages 19–32. Springer Berlin Heidelberg,

as cost term the refinement of the graph cut based

2011. 2

[6] L. Dice. Measures of the amount of ecologic as-

Toolkit for Fetal Brain MR Image Process-

sociation between species. Ecology, 26(3):297–

ing.

Computer methods and programs in

302, July 1945. 5

biomedicine, 109(1):65–73, Jan. 2013. 3

[7] S. Durrleman, X. Pennec, A. Trouvé, J. Braga,

[15] T. Saul, R. Lewiss, and M. Rivera. Accuracy

G. Gerig, and N. Ayache. Toward a comprehen-

of emergency physician performed bedside ul-

sive framework for the spatiotemporal statisti-

trasound in determining gestational age in first

cal analysis of longitudinal shape data. Interna-

trimester pregnancy. Critical Ultrasound Jour-

tional Journal of Computer Vision, 103(1):22–

nal, 4(1):1–5, Dec. 2012. 2

59, May 2013. 3

[16] J. Scott, P. Habas, K. Kim, V. Rajagopalan,

[8] P.

Habas,

K.

Kim,

J.

Corbett-Detig,

K. Hamzelou, J. Corbett-Detig, A. Barkovich,

F. Rousseau, O. Glenn, A. Barkovich, and

O. Glenn, and C. Studholme. Growth trajecto-

C. Studholme. A spatiotemporal atlas of MR

ries of the human fetal brain tissues estimated

intensity, tissue probability and shape of the

from 3D reconstructed in utero MRI. Interna-

fetal brain with application to segmentation.

tional Journal of Developmental Neuroscience,

NeuroImage, 53(2):460–470, Nov. 2010. 1, 2

29(5):529–536, Aug. 2011. 1

[9] Y. Hong, Y. Shi, M. Styner, M. Sanchez, and

[17] A. Serag, P. Aljabar, G. Ball, S. Counsell,

M. Niethammer. Simple geodesic regression

J. Boardman, M. Rutherford, A. Edwards,

for image time-series. In B. Dawant, G. Chris-

J. Hajnal, and D. Rueckert.

Construction of

tensen, J. Fitzpatrick, and D. Rueckert, editors,

a consistent high-definition spatio-temporal at-

Biomedical Image Registration, number 7359

las of the developing brain using adaptive ker-

in Lecture Notes in Computer Science, pages

nel regression. NeuroImage, 59(3):2255–2265,

11–20. Springer Berlin Heidelberg, Jan. 2012.

Feb. 2012. 1, 2

3

[18] C. Studholme. Mapping fetal brain develop-

[10] M. Kuklisova-Murgasova, P. Aljabar, L. Srini-

ment in utero using magnetic resonance imag-

vasan, S. Counsell, V. Doria, A. Serag, I. Gou-

ing: the big bang of brain mapping. Annual

sias, J. Boardman, M. Rutherford, A. Edwards,

review of biomedical engineering, 13:345–368,

J. Hajnal, and D. Rueckert.

A dynamic 4D

Aug. 2011. 1

probabilistic atlas of the developing brain. Neu-

[19] A. Toga and P. Thompson. 1 - an introduction

roImage, 54(4):2750–2763, Feb. 2011. 1, 2

to maps and atlases of the brain. In A.W. Toga

[11] M. Niethammer, Y. Huang, and F. Vialard.

and J.C. Mazziotta, editors, Brain Mapping:

Geodesic regression for image time-series. In-

The Systems, pages 3–32. Academic Press, San

ternational Conference MICCAI 2011, 14(Pt

Diego, 2000. 1

2):655–662, 2011. 3

[20] J. Yuan, E. Bae, X. Tai, and Y. Boykov.

A

[12] D. Pugash,

U. Nemec,

P. Brugger,

and

continuous max-flow approach to potts model.

D. Prayer. Fetal MRI of Normal Brain Devel-

In K. Daniilidis, P. Maragos, and N. Paragios,

opment. In D. Prayer, editor, Fetal MRI, Medi-

editors, Computer Vision ECCV 2010, num-

cal Radiology, pages 147–175. Springer Berlin

ber 6316 in Lecture Notes in Computer Sci-

Heidelberg, Jan. 2011. 5

ence, pages 379–392. Springer Berlin Heidel-

[13] L. Risser, F. Vialard, A. Serag, P. Ajabar, and

berg, Jan. 2010. 3, 4

D. Rueckert.

Construction of diffeomorphic

[21] J. Zhan, I. Dinov, J. Li, Z. Zhang, S. Hobel,

spatio-temporal atlases using Krcher means and

Y. Shi, X. Lin, A. Zamanyan, L. Feng, G. Teng,

LDDMM: Application to early cortical devel-

F. Fang, Y. Tang, F. Zang, A. Toga, and S. Liu.

opment. In Workshop on Image Analysis of Hu-

Spatialtemporal atlas of human fetal brain de-

man Brain Development (IAHBD), in Interna-

velopment during the early second trimester.

tional Conference MICCAI 2011, Sept. 2011.

NeuroImage, 82:115–126, Nov. 2013. 1, 2

1, 2

[14] F.

Rousseau,

E.

Oubel,

J.

Pontabry,

M. Schweitzer, C. Studholme, M. Koob,

and J. Dietemann.

BTK: An Open-Source

21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

Quantitative Comparison of Feature Matchers Implemented in OpenCV3

Zoltan Pusztai

Levente Hajder

Eötvös Loránd University

MTA SZTAKI

Budapest, Hungary

Kende u. 13-17. Budapest, Hungary-1111

puzsaai@inf.elte.hu

http://web.eee.sztaki.hu

Abstract. The latest V3.0 version of the popular

The description of the optical flow datasets of

Open Computer Vision (OpenCV) framework has just

Middlebury database was published in [3]. It was

been released in the middle of 2015. The aim of this

developed in order to make the optical flow methods

paper is to compare the feature trackers implemented

comparable. The latest version contains four kinds

in the framework. OpenCV contains both feature de-

of video sequences:

tector, descriptor and matcher algorithms, all possi-

ble combinations of those are tried. For the compar-

1. Fluorescent images: Nonrigid motion is taken

ison, a structured-light scanner with a turntable was

by both color and UV-camera. Dense ground

used in order to generate very accurate ground truth

truth flow is obtained using hidden fluorescent

(GT) tracking data. The tested algorithm on track-

texture painted on the scene. The scenes are

ing data of four rotating objects are compared. The

moved slowly, at each point capturing separate

results is quantitatively evaluated as the matched co-

test images in visible light, and ground truth im-

ordinates can be compared to the GT values.

ages with trackable texture in UV light.

2. Synthesized database: Realistic images are gen-

1. INTRODUCTION

erated by an image syntheses method.

The

tracked data can be computed by this system

Developing a realistic 3D approach for feature

as every parameters of the cameras and the 3D

tracker evaluation is very challenging since realisti-

scene are known.

cally moving 3D objects can simultaneously rotate

and translate, moreover, occlusion can also appear in

3. Imagery for Frame Interpolation. GT data is

the images. It is not easy to implement a system that

computed by interpolating the frames. There-

can generate ground truth (GT) data for real-world

fore the data is computed by a prediction from

3D objects. The Middlebury database1 is consid-

the measured frames.

ered as the state-of-the-art GT feature point gener-

4. Stereo Images of Rigid Scenes. Structured light

ator. The database itself consists of several datasets

scanning is applied first to obtain stereo re-

that had been continuously developed since 2002. In

construction. (Scharstein and Szeliski 2003).

the first period, they generated corresponding feature

The optical flow is computed from ground truth

points of real-world objects [23]. The first Middle-

stereo data.

bury dataset can be used for the comparison of fea-

ture matchers. Later on, this stereo database was ex-

The main limitation of the Middlebury optical

tended with novel datasets using structured-light [24]

flow database is that the objects move approximately

or conditional random fields [18]. Even subpixel ac-

linearly, there is no rotating object in the datasets.

curacy can be achieved in this way as it is discussed

This is a very strict limitation as tracking is a chal-

in [22].

lenging task mainly when the same texture is seen

However, the stereo setup is too strict limitation

from different viewpoint.

for us, our goal is to obtain tracking data via multiple

It is interesting that the Middlebury multi-view

frames.

database [25] contains ground truth 3D reconstruc-

1http://vision.middlebury.edu/

tion of two objects, however, the ground truth track-





ing data were not generated for these sequences. An-

• Poster. The last sequence of our dataset is a

other limitation of the dataset is that only two low-

rotating poster in a page of a motorcycle mag-

textured objects are used.

azine. It is a relatively easy object for feature

It is obvious that tracking data can also be gen-

matchers since it is a well-textured plane. The

erated by a depth camera [26] such as Microsoft

pure efficiency of the trackers can be checked

Kinect, but its accuracy is very limited. There are

in this example due to two reasons: (i) there is

other interesting GT generators for planar objects

no occlusion, and (ii) the GT feature tracking is

such as the work proposed in [8], however, we would

equivalent to the determination of plane-plane

like to obtain the tracked feature points of real spatial

homographies.

objects.

Due to these limitations, we decided to build a spe-

cial hardware in order to generate ground truth data.

Our approach is based on a turntable, a camera, and

a projector. They are not too costly, but the whole

setup is very accurate as it is shown in our accepted

paper [19].

2. Datasets

We have generated four GT datasets as it is pub-

lished in our mentioned paper [19].

The feature

points are always selected by the tested feature gen-

erator method in all frames and then these feature

locations are matched between the frames.

Then

the matched point are filtered: the fundamental ma-

trix [9] is robustly computed using 8-point method

with RANSAC for every image pair and the outliers

are removed from the results. The method imple-

mented in the OpenCV framework is used for this

robustification.

Examples for the moving GT feature points of the

generated sets are visualized in Figures 1– 4. Point

locations are visualized by light blue dots.

The feature matchers are tested in four data se-

quences:

• Dinosaur.

A typical computer vision study

deals with the reconstruction of a dinosaurs as

it is shown in several scientific papers, e.g [6].

Figure 5. Reconstructed 3D model of testing objects. Top:

Plush Dog. Center: Dinosaur. Bottom: Flacon.

It has a simple diffuse surface that is easy to re-

construct in 3D, hence the feature matching is

possible. For this reason, a dino is inserted to

our testing dataset.

2.1. GT Data Generation

Firstly,

the possibilities is overviewed that

• Flacon. The plastic holder is another smooth

OpenCV can give about feature tracking. These are

and diffuse surface. A well-textured label is

the currently supported feature detectors in OpenCV

fixed on the surface.

AGAST [13], AKAZE [17], BRISK [10], FAST [20],

• Plush Dog. The tracking of the feature point

GFTT [28] (Good Features To Track – also known

of a soft toy is a challenging task as it does not

as Shi-Tomasi corners), KAZE [2], MSER [14],

have a flat surface. A plush dog is included into

ORB [21].

the testing database that is a real challenge for

However, if you compile the contrib(nonfree)

feature trackers.

repository with the OpenCV, you can also get the





Figure 1. GT moving feature points of sequence ’Flacon’.

Figure 2. GT moving feature points of sequence ’Poster’.

following detectors:

SIFT [12], STAR [1], and

the matching is started. Every image pair is taken

SURF [4].

into consideration, and match each feature point in

We use our scanner to take 20 images about a

the first image with one in the second image. This

rotating object.

After each image taken, a struc-

means that every feature point in the first image will

tured light sequence is projected in order to make

have a pair in the second one. However, there can be

the reconstruction available for every position. (re-

some feature locations in the second image, which

constructing only the points in the first image is not

has more corresponding feature points in the first

enough.)

one, but it is also possible that there is no matching

Then we start searching for features in these im-

point.

ages using all feature detectors. After the detection

The matching itself is done by calculating the

is completed, it is required to extract descriptors. De-

minimum distances between the descriptor vectors.

scriptors are needed for matching the feature points

This distance is defined by the feature tracking

in different frames. The following descriptors are

method used. The following matchers are available

used (each can be found in OpenCV): AKAZE [17],

in OpenCV:

BRISK [10], KAZE [2], ORB [21]. If one compiles the contrib repository, he/she can also get SIFT [12],

• L2 – BruteForce: a brute force minimization al-

SURF [4], BRIEF [5], FREAK [16], LATCH [11],

gorithm that computes each possible matches.

DAISY [27] descriptors 2.

The error is the L2 norm of the difference be-

Another important issue is the parameterization of

tween feature descriptors.

the feature trackers. It is obvious that the most ac-

• L1 – BruteForce: It is the same as L2 – Brute-

curate strategy is to find the best system parameters

Force, but L1 norm is used instead of L2 one.

for the methods, nevertheless the optimal parameters

can differ for each testing video. On the other hand,

• Hamming – BruteForce:

For binary fea-

we think that the authors of the tested methods can

ture descriptor (BRISK, BRIEF, FREAK,

set the parameters more accurately than us as they

LETCH,ORB,AKAZE), the Hamming distance

are interested in good performance. For this reason,

is used.

the default parameter setting is used for each method,

and we plan to make the dataset available for every-

• Hamming2 – BruteForce: A variant of the ham-

one and then the authors themselves can parameter-

ming distance is used. The difference between

ize their methods.

Hamming and Hamming2 is that the former

After the detection and the extraction are done,

considers every bit as element of the vector,

while Hamming2 use integer number, each bit

2The BRIEF descriptor is not invariant to rotation, however,

pair forms a number from interval 0 . . . 3 3.

we hold it in the set of testing algorithms as it surprisingly served

good results.

3OpenCV’s documentation is not very informative about





Figure 3. GT moving feature points of sequence ’Dinosaur’.

Figure 4. GT moving feature points of sequence ’Plush Dog’.

• Flann-Based: FLANN (Fast Library for Ap-

proximate Nearest Neighbors) is a set of al-

gorithms optimized for fast nearest neighbor

search in large datasets and for high dimen-

sional features [15].

It is needed to point out that one can pair each fea-

ture detector with each feature descriptor but each

feature matchers is not applicable for every descrip-

tor. An exception is thrown by OpenCV if the se-

lected algorithms cannot work together. But we try

to evaluate every possible selection.

The comparison of the feature tracker predictions

with the ground truth data is as follows: The feature

points are reconstructed first in 3D using the images

Figure 6. Error measurement based on simple Euclidean

and the structured light. Then, because it is known

distances.

that the turntable was rotated by 3 degrees per im-

ages, the projections of the points are calculated for

all the remaining images. These projections were

However, this comparison is not good enough be-

compared to the matched point locations of the fea-

cause if a method fails to match correctly the feature

ture trackers and the L

points in an image pair, then the feature point moves

2 norm is used to calculate the

distances.

to an incorrect location in the next image. Therefore,

the tracker follows the incorrect location in the re-

3. Evaluation Methodology

maining frames and the new matching positions in

those images will also be incorrect.

The easiest and usual way for comparing the

To avoid this effect, a new GT point is generated

tracked feature points is to compute the summa

at the location of the matched point even if it is an

and/or average and/or median of the 2D tracking er-

incorrect matching. The GT location of that point

rors in each image. This error is defined as the Eu-

can be determined in the remaining frames since that

clidean distance of the tracked and GT locations.

point can be reconstructed in 3D as well using the

This methodology is visualized in Fig. 6.

structured light scanning, and the novel positions of

the new GT point can be determined using the cali-

Hamming2 distance. They suggest the usage of that for ORB

bration data of the test sequence.

features. However, it can be applied for other possible descrip-

tors, all possible combinations are tried during our tests.

Then the novel matching results are compared to



all the previously determined GT points. The ob-

is also counted. Furthermore, the average length of

tained error values are visualized in Fig. 7.

the feature tracks is calculated which shows that in

The error of a feature point for the i-th frame is the

how many images an average feature point is tracked

weighted average of all the errors calculated for that

through.

feature. For example, there is only one error value

for the second frame as the matching error can only

4. Comparison of the methods

be compared to the GT location of the feature de-

The purpose of this section is to show the main is-

tected in the first image. For the third frame, there

sues occurred during the testing of the feature match-

are two GT locations since GT error generated on

ers. Unfortunately, we cannot show to the Reader all

both the first (original position) and second (position

the charts due to the lack of space.

from first matching) image. For the i-th image, i − 1

General remark. The charts in this section show

error values are obtained. the error is calculated as

different combinations of detectors, descriptors, and

the weighted average of those. It can be formalized

matchers. The method ’xxx:yyy:zzz’ denotes in the

as

charts that the current method uses the detector ’xxx’,

i−1 ||p

||

X

i − p0i,n 2

Error

descriptor ’yyy’, and matcher algorithm ’zzz’.

p =

(1)

i

i − n

n=1

4.1. Feature Generation and Filtering using the

where Errorp is the error for the i-th frame, p

i

i

Fundamental Matrix

the location of the tested feature detector, while p0i,n

The number of the detected feature points is exam-

is the GT location of the feature points reconstructed

ined first. It is an important property of the matcher

from the n-th frame. The weights of the distances is

algorithms since many good points are required for a

1/(i − n) that means that older GT points has less

typical computer vision application. For example, at

weights. Remark that the Euclidean (L2) norm is

least hundreds of points are required to compute 3D

chosen in order to measure the pixel distances.

reconstruction of the observed scene. The matched

If a feature point is only detected in one image

and filtered values are calculated as the average of

and was not being followed in the next one (or was

the numbers of generated features for all the frames

filtered out in the fundamental-matrix-based filtering

as features can be independently generated in each

step), then that point is discarded.

image of the test sequences. Tables 1– 4 show the

number of the generated features (left) and that of

the filtered ones.

There are a few interesting behaviors within the

data:

• The best images for feature tracking are ob-

tained when the poster is rotated. The feature

generators give significantly the most points in

this case. It is a more challenging task to find

goof feature points for the rotating dog and di-

nosaur. It is because the area of these objects in

the images are smaller than that of the other two

ones (flacon and poster).

• It is clearly seen that number of SURF feature

points are the highest in all test cases after out-

Figure 7. Applied error measurement.

lier removal. This fact suggests that they will be

the more accurate features.

After the pixel errors are valuated for each point

in all possible images, the minimum, maximum,

• The MSER method gives the most number of

summa, average, and median error values of every

feature points, however, more than 90% of those

feature points are calculated per image. The num-

are filtered.

Unfortunately, the OpenCV3 li-

ber of tracked feature points in the processed image

brary does not contain sophisticate matchers for

Table 1. Average of generated feature points and inliers of Table 3. Average of generated feature points and inliers of

Sequence ’Plush Dog’.

Sequence ’Flacon’.

Detector

#Features

#Inliers

Detector

#Features

#Inliers

BRISK

21.7

16.9

BRISK

219.7

160.99

FAST

19.65

9.48

FAST

387.05

275.4

GFTT

1000

38.16

GFTT

1000

593.4

KAZE

68.6

40.76

KAZE

484.1

387.93

MSER

5321.1

10.56

MSER

3664.1

31.72

ORB

42.25

34.12

ORB

337.65

287.49

SIFT

67.7

42.8

SIFT

348.15

260.91

STAR

7.15

5.97

STAR

69.1

54.86

SURF

514.05

326.02

SURF

952.95

726.83

AGAST

22.45

11.83

AGAST

410.15

303.45

AKAZE

144

101.68

AKAZE

655

553.11

Table 2. Average of generated feature points and inliers of

Table 4. Average of generated feature points and inliers of

Sequence ’Poster’.

Sequence ’Dinosaur’.

Detector

#Features

#Inliers

Detector

#Features

#Inliers

BRISK

233.55

188.79

BRISK

21.55

14.8

FAST

224.75

139.22

FAST

51.05

27.01

GFTT

956.65

618.75

GFTT

1000

92

KAZE

573.45

469.18

KAZE

58.55

33.92

MSER

4863.6

40.29

MSER

5144.4

17.86

ORB

259.5

230.76

ORB

67.1

45.87

SIFT

413.35

343.08

SIFT

52.8

34.96

STAR

41.25

35.22

STAR

3.45

3.45

SURF

1876.95

1577.73

SURF

276.95

132.61

AGAST

275.75

200.25

AGAST

55

29.86

AKAZE

815

761.4

AKAZE

89.1

59.2

MSER such as [7], therefore its accuracy is rel-

atively low.

amined, while the detectors are only combined with

their own descriptor in the second test.

• Remark that the GFTT algorithm usually gives

It is important to note that not only the errors of

1000 points as the maximum number was set to

feature trackers should be compared, we must also

thousand for this method. It is a parameter of

pay attention to the number of features in the images

OpenCV that may be changed, but we did not

and the length of the feature tracks. A method with

modify this value.

less detected features usually obtains better results

(lower error rate) than other methods with higher

4.2. Matching accuracy

number of features. The mostly used chart is the

Two comparisons were carried out for the feature

AVG-MED, where the average and the median of the

tracker methods. In the first test, every possible com-

errors are shown.

bination of the feature detectors and descriptors is ex-

Testing of all possible algorithms.





As it is seen in Fig 8 (sequence ’Plush Dog’), the

the objects is rich in features, but the ’Flacon’ is a

SURF method dominates the chart. With the usage

spatial object. However, if we look at Fig. 10 where

of SURF, DAISY, BRIEF, and BRISK descriptors

the methods with the lowest 10 median value were

more than 300 feature points remained and the me-

plotted, one can see that KAZE and SIFT had more

dian values of the errors are below 2.5 pixels, while

feature points and can track these over more pictures

the average is around 5 pixels. Moreover, the points

than MSER or SURF after the fundamental filtering.

are tracked through 4 images in average which yields

Even though they had the lowest median values, the

pretty impressive statistics for the SURF detector.

average errors of these methods were rather high.

However, if one takes a look at the methods with

the lowest average error, then he/she can observe that

AKAZE, KAZE and SURF present in the top 10.

These methods can track more points then the pre-

vious ones and the median errors are just around 2.0

pixels.

Figure 8. Average and median errors of top 10 methods

for sequence ’Plush Dog’.

The next test object was the ’Poster’. The results

are visualized in Fig 4.2. It is interesting to note that if the trackers are sorted by the number of the outliers

and plot the top 10 methods, only the AKAZE detec-

tor remains where more than 90 percent of the fea-

ture points was considered as inlier. Besides the high

number of points, average pixel error is between 3

and 5 pixels depending on the descriptor and matcher

type.

Figure 10. Top 10 method with the lowest median for se-

quence ’Flacon’. Chart are sorted by median (top) and

average (bottom) values.

For the sequence ’Dinosaur’ (Figure 11), the test

object is very dark which makes feature detection

hard. The number of available points is slightly more

Figure 9. Average and median errors of top 10 methods

than 100. In this case, the overall winner of the meth-

for sequence ’Poster’.

ods is the SURF with both the lowest average and

median errors. However, GFTT also present in the

last chart too.

In the test where the ’Flacon’ object was used, we

In the upper comparisons only the detectors were

got similar results as in the case of ’Poster’. Both of

mentioned against each other. As one can see in





the charts, most of the methods used either DAISY,

BRIEF, BRISK or SURF descriptors. From the per-

spective of matchers, it does not really matter which

type of the matcher is used for the same detector

descriptor type. However, if the descriptor gives a

binary vector, then obviously the hamming distance

outperforms the L2 or L1. But there are just slightly

differences between the L1-L2 and H1-H2 distances.

Figure 12. Overall average (top) and median (bottom) er-

Figure 11. Top 10 methods (with lowest average error) on

ror values for all trackers and test sequences. The detec-

sequence ’Dinosaur’.

tors and descriptors were the same.

Testing of algorithms with same detector and de-

scriptor. In this comparison, only the detectors that

The most important conclusion for us is that such a

have an own descriptor are tested. Always the best

comparison is a very hard task: for example, there are

matchers is selected for which the error is minimal

infinite number of possible error metrics; the quality

for the observed detector/descriptor.

is hardly influenced by the number of features, and so

As it can be seen in the log-scale charts in Fig. 12,

on. The main limitation here is that we can only test

the median error is almost the same for the AKAZE,

the methods in images of rotating objects. We are not

KAZE, ORB and SURF trackers, but SURF is con-

sure that the same performance would be obtained if

sidered with the lowest average value. The tests ’Fla-

translating objects are observed. A possible solution

con’ and ’Poster’ result the lower pixel errors. On

to the extension of this paper is to compare the same

the other hand the rotation of the ’Dinosaur’ was the

methods on the Middlebury database and unify the

hardest to track, it resulted much higher errors for all

obtained results for rotation and translation.

trackers comparing to the other tests.

We hope that this paper is just the very first step of

our research. We plan to generate more testing data,

5. Conclusions, Limitations, and Future

and more algorithms will also be involved into the

Work

tests. The GT dataset will be online, and an open-

source testing system is also planned to be available

We quantitatively compared the well-known fea-

soon 4.

ture detectors, descriptors, and matchers imple-

mented in OpenCV3 in this study. The GT datasets

References

was generated by a structured-light scanner. The four

testing objects were rotated by the turntable of our

[1] M. Agrawal and K. Konolige. Censure: Center sur-

equipment. It seems to be clear that the most accu-

round extremas for realtime feature detection and

matching. In ECCV, 2008. 3

rate feature for matching methods is the SURF [4]

one proposed by Bay et al. It outperforms the other

[2] P. F. Alcantarilla, A. Bartoli, and A. J. Davison.

Kaze features. In ECCV (6), pages 214–227, 2012.

algorithms in all test cases. The other very accurate

2, 3, 8

algorithms are KAZE [2]/AKAZE [17], they are the

runner-up in our competition.

4See http://web.eee.sztaki.hu

[3] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black,

[17] A. B. Pablo Alcantarilla (Georgia Institute of Tech-

and R. Szeliski. A database and evaluation method-

nolog), Jesus Nuevo (TrueVision Solutions AU).

ology for optical flow.

International Journal of

Fast explicit diffusion for accelerated features in

Computer Vision, 92(1):1–31, 2011. 1

nonlinear scale spaces. In Proceedings of the British

[4] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool.

Machine Vision Conference. BMVA Press, 2013. 2,

Speeded-up robust features (surf). Comput. Vis. Im-

3, 8

age Underst., 110(3):346–359, 2008. 3, 8

[18] C. J. Pal, J. J. Weinman, L. C. Tran, and

[5] M. Calonder, V. Lepetit, C. Strecha, and P. Fua.

D. Scharstein. On learning conditional random fields

Brief: Binary robust independent elementary fea-

for stereo - exploring model structures and approxi-

tures. In Proceedings of the 11th European Confer-

mate inference. International Journal of Computer

ence on Computer Vision: Part IV, pages 778–792,

Vision, 99(3):319–337, 2012. 1

2010. 3

[19] Z. Pusztai and L. Hajder. A Turntable-based Ap-

[6] A. W. ”Fitzgibbon, G. Cross, and A. Zisserman.

proach for Ground Truth Tracking Data Generation

”automatic 3D model construction for turn-table se-

. In VISAPP 2016, pages 498–509, 2016. 2

quences”. In ”3D Structure from Multiple Images

[20] E. Rosten and T. Drummond. Fusing points and lines

of Large-Scale Environments, LNCS 1506”, pages

for high performance tracking.

In In Internation

”155–170”, ”1998”. 2

Conference on Computer Vision, pages 1508–1515,

[7] P.-E. Forssn and D. G. Lowe. Shape descriptors for

2005. 2

maximally stable extremal regions. In ICCV. IEEE,

[21] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski.

2007. 6

Orb: An efficient alternative to sift or surf. In Inter-

[8] S. Gauglitz, T. Höllerer, and M. Turk. Evaluation

national Conference on Computer Vision, 2011. 2,

of interest point detectors and feature descriptors for

3

visual tracking. International Journal of Computer

[22] D. Scharstein,

H. Hirschmüller,

Y. Kitajima,

Vision, 94(3):335–360, 2011. 2

G. Krathwohl, N. Nesic, X. Wang, and P. West-

[9] R. I. Hartley and A. Zisserman. Multiple View Ge-

ling. High-resolution stereo datasets with subpixel-

ometry in Computer Vision. Cambridge University

accurate ground truth. In Pattern Recognition - 36th

Press, 2003. 2

German Conference, GCPR 2014, Münster, Ger-

many, September 2-5, 2014, Proceedings, pages 31–

[10] S. Leutenegger, M. Chli, and R. Y. Siegwart. Brisk:

42, 2014. 1

Binary robust invariant scalable keypoints. In Pro-

ceedings of the 2011 International Conference on

[23] D. Scharstein and R. Szeliski.

A Taxonomy and

Computer Vision, ICCV ’11, pages 2548–2555,

Evaluation of Dense Two-Frame Stereo Correspon-

2011. 2, 3

dence Algorithms. International Journal of Com-

puter Vision, 47:7–42, 2002. 1

[11] G. Levi and T. Hassner. LATCH: learned arrange-

ments of three patch codes. CoRR, 2015. 3

[24] D. Scharstein and R. Szeliski. High-accuracy stereo

depth maps using structured light.

In CVPR (1),

[12] D. G. Lowe. Object recognition from local scale-

pages 195–202, 2003. 1

invariant features. In Proceedings of the Interna-

[25] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and

tional Conference on Computer Vision, ICCV ’99,

pages 1150–1157, 1999. 3

R. Szeliski. A comparison and evaluation of multi-

view stereo reconstruction algorithms. In 2006 IEEE

[13] E. Mair, G. D. Hager, D. Burschka, M. Suppa, and

Computer Society Conference on Computer Vision

G. Hirzinger. Adaptive and generic corner detection

and Pattern Recognition (CVPR 2006), 17-22 June

based on the accelerated segment test. In Proceed-

2006, New York, NY, USA, pages 519–528, 2006. 1

ings of the 11th European Conference on Computer

[26] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and

Vision: Part II, pages 183–196, 2010. 2

D. Cremers.

”a benchmark for the evaluation of

[14] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust

rgb-d slam systems”. In ”Proc. of the International

wide baseline stereo from maximally stable extremal

Conference on Intelligent Robot Systems (IROS)”,

regions. In Proc. BMVC, pages 36.1–36.10, 2002. 2

”2012”. 2

[15] M. Muja and D. G. Lowe. Fast approximate nearest

[27] E. Tola, V. Lepetit, and P. Fua.

Daisy: An ef-

neighbors with automatic algorithm configuration.

ficient dense descriptor applied to wide baseline

In In VISAPP International Conference on Com-

stereo. IEEE TRANS. PATTERN ANALYSIS AND

puter Vision Theory and Applications, pages 331–

MACHINE INTELLIGENCE, 32(5), 2010. 3

340, 2009. 4

[28] Tomasi, C. and Shi, J. Good Features to Track. In

[16] R. Ortiz. Freak: Fast retina keypoint. In Proceed-

IEEE Conf. Computer Vision and Pattern Recogni-

ings of the 2012 IEEE Conference on Computer Vi-

tion, pages 593–600, 1994. 2

sion and Pattern Recognition (CVPR), pages 510–

517, 2012. 3



21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

Real-Time Eye Blink Detection using Facial Landmarks

Tereza Soukupová and Jan Čech

Center for Machine Perception, Department of Cybernetics

Faculty of Electrical Engineering, Czech Technical University in Prague

{soukuter,cechj}@cmp.felk.cvut.cz

Abstract. A real-time algorithm to detect eye blinks

in a video sequence from a standard camera is pro-

posed. Recent landmark detectors, trained on in-the-

wild datasets exhibit excellent robustness against a

head orientation with respect to a camera, varying

illumination and facial expressions. We show that

the landmarks are detected precisely enough to reli-

ably estimate the level of the eye opening. The pro-

posed algorithm therefore estimates the landmark

positions, extracts a single scalar quantity – eye as-

pect ratio (EAR) – characterizing the eye opening in

Figure 1: Open and closed eyes with landmarks pi

each frame. Finally, an SVM classifier detects eye

automatically detected by [1]. The eye aspect ratio

blinks as a pattern of EAR values in a short tempo-

EAR in Eq. (1) plotted for several frames of a video

ral window. The simple algorithm outperforms the

sequence. A single blink is present.

state-of-the-art results on two standard datasets.

1. Introduction

a Viola-Jones type detector. Next, motion in the eye

Detecting eye blinks is important for instance in

area is estimated from optical flow, by sparse track-

systems that monitor a human operator vigilance,

ing [7, 8], or by frame-to-frame intensity differenc-

e.g. driver drowsiness [5, 13], in systems that warn

ing and adaptive thresholding. Finally, a decision is

a computer user staring at the screen without blink-

made whether the eyes are or are not covered by eye-

ing for a long time to prevent the dry eye and the

lids [9, 15]. A different approach is to infer the state computer vision syndromes [17, 7, 8], in human-of the eye opening from a single image, as e.g. by

computer interfaces that ease communication for dis-

correlation matching with open and closed eye tem-

abled people [15], or for anti-spoofing protection in

plates [4], a heuristic horizontal or vertical image in-

face recognition systems [11].

tensity projection over the eye region [5, 6], a para-

metric model fitting to find the eyelids [18], or active Existing methods are either active or passive. Ac-shape models [14].

tive methods are reliable but use special hardware,

often expensive and intrusive, e.g. infrared cameras

A major drawback of the previous approaches is

and illuminators [2], wearable devices, glasses with

that they usually implicitly impose too strong re-

a special close-up cameras observing the eyes [10].

quirements on the setup, in the sense of a relative

While the passive systems rely on a standard remote

face-camera pose (head orientation), image resolu-

camera only.

tion, illumination, motion dynamics, etc. Especially

Many methods have been proposed to automati-

the heuristic methods that use raw image intensity

cally detect eye blinks in a video sequence. Several

are likely to be very sensitive despite their real-time

methods are based on a motion estimation in the eye

performance.

region. Typically, the face and eyes are detected by



However nowadays, robust real-time facial land-

mark detectors that capture most of the character-

istic points on a human face image, including eye

corners and eyelids, are available, see Fig. 1. Most

of the state-of-the-art landmark detectors formulate

a regression problem, where a mapping from an im-

age into landmark positions [16] or into other land-

mark parametrization [1] is learned.

These mod-

ern landmark detectors are trained on “in-the-wild

datasets” and they are thus robust to varying illu-

mination, various facial expressions, and moderate

non-frontal head rotations. An average error of the

landmark localization of a state-of-the-art detector

is usually below five percent of the inter-ocular dis-

Eye aspect ratio:

tance. Recent methods run even significantly super

0.4

real-time [12].

0.2

Therefore, we propose a simple but efficient al-

0

gorithm to detect eye blinks by using a recent facial

landmark detector. A single scalar quantity that re-

EAR thresholding (t = 0.2):

blink

flects a level of the eye opening is derived from the

landmarks. Finally, having a per-frame sequence of

the eye opening estimates, the eye blinks are found

non-blink

by an SVM classifier that is trained on examples of

blinking and non-blinking patterns.

EAR SVM output:

blink

Facial segmentation model presented in [14] is

similar to the proposed method. However, their sys-

tem is based on active shape models with reported

non-blink

processing time of about 5 seconds per frame for the

Ground-truth:

segmentation, and the eye opening signal is normal-

blink

ized by statistics estimated by observing a longer se-

quence. The system is thus usable for offline pro-

half

cessing only. The proposed algorithm runs real-time,

non-blink

since the extra costs of the eye opening from land-

marks and the linear SVM are negligible.

Figure 2: Example of detected blinks. The plots of

The contributions of the paper are:

the eye aspect ratio EAR in Eq. (1), results of the

EAR thresholding (threshold set to 0.2), the blinks

1. Ability of two state-of-the-art landmark de-

detected by EAR SVM and the ground-truth labels

tectors [1, 16] to reliably distinguish between

over the video sequence. Input image with detected

the open and closed eye states is quantita-

landmarks (depicted frame is marked by a red line).

tively demonstrated on a challenging in-the-

wild dataset and for various face image resolu-

tions.

tion and evaluation is presented in Sec. 3. Finally,

Sec. 4 concludes the paper.

2. A novel real-time eye blink detection algorithm

which integrates a landmark detector and a clas-

2. Proposed method

sifier is proposed. The evaluation is done on two

standard datasets [11, 8] achieving state-of-the-

The eye blink is a fast closing and reopening of

art results.

a human eye. Each individual has a little bit different

pattern of blinks. The pattern differs in the speed of

The rest of the paper is structured as follows: The

closing and opening, a degree of squeezing the eye

algorithm is detailed in Sec. 2, experimental valida-

and in a blink duration. The eye blink lasts approxi-

mately 100-400 ms.

ground-truth blinks, while the negatives are those

We propose to exploit state-of-the-art facial land-

that are sampled from parts of the videos where no

mark detectors to localize the eyes and eyelid con-

blink occurs, with 5 frames spacing and 7 frames

tours. From the landmarks detected in the image,

margin from the ground-truth blinks. While testing, a

we derive the eye aspect ratio (EAR) that is used as

classifier is executed in a scanning-window fashion.

an estimate of the eye opening state. Since the per-

A 13-dimensional feature is computed and classified

frame EAR may not necessarily recognize the eye

by EAR SVM for each frame except the beginning

blinks correctly, a classifier that takes a larger tem-

and ending of a video sequence.

poral window of a frame into account is trained.

3. Experiments

2.1. Description of features

Two types of experiments were carried out: The

For every video frame, the eye landmarks are de-

experiments that measure accuracy of the landmark

tected. The eye aspect ratio (EAR) between height

detectors, see Sec. 3.1, and the experiments that eval-

and width of the eye is computed.

uate performance of the whole eye blink detection

kp2 − p6k + kp3 − p5k

algorithm, see Sec 3.2.

EAR =

,

(1)

2kp1 − p4k

3.1. Accuracy of landmark detectors

where p1, . . . , p6 are the 2D landmark locations, de-

picted in Fig. 1.

To evaluate accuracy of tested landmark detectors,

The EAR is mostly constant when an eye is open

we used the 300-VW dataset [19]. It is a dataset con-

and is getting close to zero while closing an eye. It

taining 50 videos where each frame has associated a

is partially person and head pose insensitive. Aspect

precise annotation of facial landmarks. The videos

ratio of the open eye has a small variance among indi-

are “in-the-wild”, mostly recorded from a TV.

viduals and it is fully invariant to a uniform scaling of

The purpose of the following tests is to demon-

the image and in-plane rotation of the face. Since eye

strate that recent landmark detectors are particularly

blinking is performed by both eyes synchronously,

robust and precise in detecting eyes, i.e. the eye-

the EAR of both eyes is averaged. An example of

corners and contour of the eyelids. Therefore we pre-

an EAR signal over the video sequence is shown in

pared a dataset, a subset of the 300-VW, containing

Fig. 1, 2, 7.

sample images with both open and closed eyes. More

A similar feature to measure the eye opening was

precisely, having the ground-truth landmark annota-

suggested in [9], but it was derived from the eye seg-

tion, we sorted the frames for each subject by the eye

mentation in a binary image.

aspect ratio (EAR in Eq. (1)) and took 10 frames of

the highest ratio (eyes wide open), 10 frames of the

2.2. Classification

lowest ratio (mostly eyes tightly shut) and 10 frames

It generally does not hold that low value of the

sampled randomly. Thus we collected 1500 images.

EAR means that a person is blinking. A low value

Moreover, all the images were later subsampled (suc-

of the EAR may occur when a subject closes his/her

cessively 10 times by factor 0.75) in order to evaluate

eyes intentionally for a longer time or performs a fa-

accuracy of tested detectors on small face images.

cial expression, yawning, etc., or the EAR captures a

Two state-of-the-art landmark detectors were

short random fluctuation of the landmarks.

tested: Chehra [1] and Intraface [16]. Both run in Therefore, we propose a classifier that takes a

real-time1. Samples from the dataset are shown in

larger temporal window of a frame as an input. For

Fig. 3. Notice that faces are not always frontal to the

the 30fps videos, we experimentally found that ±6

camera, the expression is not always neutral, peo-

frames can have a significant impact on a blink detec-

ple are often emotionally speaking or smiling, etc.

tion for a frame where an eye is the most closed when

Sometimes people wear glasses, hair may occasion-

blinking. Thus, for each frame, a 13-dimensional

ally partially occlude one of the eyes. Both detectors

feature is gathered by concatenating the EARs of its

perform generally well, but the Intraface is more ro-

±6 neighboring frames.

bust to very small face images, sometimes at impres-

This is implemented by a linear SVM classifier

sive extent as shown in Fig. 3.

(called EAR SVM) trained from manually anno-

tated sequences. Positive examples are collected as

1Intraface runs in 50 Hz on a standard laptop.





All landmarks

Chehra

Chehra

100



Intraface

Intraface

80

60

40

occurance [%]

Chehra





20

Intraface





Chehra−small

Chehra

Chehra

Intraface−small

Intraface

Intraface

0 0

5

10

15

20

25

localization error [% of IOD]

Eye landmarks

100



80

60





Figure 3: Example images from the 300-VW dataset

40

occurance [%]

with landmarks obtained by Chehra [1] and In-

Chehra

20

Intraface

traface [16]. Original images (left) with inter-ocular

Chehra−small

distance (IOD) equal to 63 (top) and 53 (bottom) pix-

Intraface−small

0

els. Images subsampled (right) to IOD equal to 6.3

0

5

10

15

20

25

localization error [% of IOD]

(top) and 17 (bottom).

Figure 4: Cumulative histogram of average localiza-

tion error of all 49 landmarks (top) and 12 landmarks

Quantitatively, the accuracy of the landmark de-

of the eyes (bottom). The histograms are computed

tection for a face image is measured by the average

for original resolution images (solid lines) and a sub-

relative landmark localization error, defined as usu-

set of small images (IOD ≤ 50 px).

ally

N

100 X

=

||xi − ˆ

xi||2,

(2)

the Intraface is always more precise than Chehra. As

κN i=1

already mentioned, the Intraface is much more robust

where x

to small images than Chehra. This behaviour is fur-

i is the ground-truth location of landmark i

in the image, ˆ

x

ther observed in the following experiment.

i is an estimated landmark location by

a detector, N is a number of landmarks and normal-

Taking a set of all 15k images, we measured a

ization factor κ is the inter-ocular distance (IOD), i.e.

mean localization error µ as a function of a face im-

Euclidean distance between eye centers in the image.

age resolution determined by the IOD. More pre-

P

First, a standard cumulative histogram of the aver-

cisely, µ = 1

|S|

j∈S j , i.e. average error over set of

age relative landmark localization error was calcu-

face images S having the IOD in a given range. Re-

lated, see Fig. 4, for a complete set of 49 landmarks

sults are shown in Fig. 5. Plots have errorbars of stan-

and also for a subset of 12 landmarks of the eyes only,

dard deviation. It is seen that Chehra fails quickly

since these landmarks are used in the proposed eye

for images with IOD < 20 px. For larger faces, the

blink detector. The results are calculated for all the

mean error is comparable, although slightly better for

original images that have average IOD around 80 px,

Intraface for the eye landmarks.

and also for all “small” face images (including sub-

The last test is directly related to the eye blink de-

sampled ones) having IOD ≤ 50 px. For all land-

tector. We measured accuracy of EAR as a func-

marks, Chehra has more occurrences of very small

tion of the IOD. Mean EAR error is defined as a

errors (up to 5 percent of the IOD), but Intraface is

mean absolute difference between the true and the

more robust having more occurrences of errors be-

estimated EAR. The plots are computed for two sub-

low 10 percent of the IOD. For eye landmarks only,

sets: closed/closing (average true ratio 0.05 ± 0.05)

All landmarks

Low opening ratio (ρ < 0.15)

50



0.4



Chehra

Chehra

Intraface

0.35

Intraface

40

0.3

0.25

30

0.2

20

0.15

0.1

mean error [% of IOD]

mean eye opening error

10

0.05

0

0

0

20

40

60

80

100

0

20

40

60

80

100

IOD [px]

IOD [px]

High opening ratio (ρ > 0.25)

Eye landmarks

0.4



50



Chehra

Chehra

0.35

Intraface

Intraface

40

0.3

0.25

30

0.2

0.15

20

0.1

mean eye opening error

mean error [% of IOD] 10

0.05

0 0

20

40

60

80

100

0 0

20

40

60

80

100

IOD [px]

IOD [px]

Figure 6: Accuracy of the eye-opening ratio as a

Figure 5: Landmark localization accuracy as a func-

function of the face image resolution.

Top: for

tion of the face image resolution computed for all

images with small true ratio (mostly closing/closed

landmarks and eye landmarks only.

eyes), and bottom: images with higher ratio (open

eyes).

and open eyes (average true ratio 0.4 ± 0.1). The

error is higher for closed eyes. The reason is prob-

either smile nor speak. A ground-truth blink is de-

ably that both detectors are more likely to output

fined by its beginning frame, peak frame and ending

open eyes in case of a failure. It is seen that ratio

frame. The second database Eyeblink8 [8] is more

error for IOD < 20 px causes a major confusion

challenging. It consists of 8 long videos of 4 sub-

between open/close eye states for Chehra, neverthe-

jects that are smiling, rotating head naturally, cover-

less for larger faces the ratio is estimated precisely

ing face with hands, yawning, drinking and looking

enough to ensure a reliable eye blink detection.

down probably on a keyboard. These videos have

length from 5k to 11k frames, also 30fps, with a res-

3.2. Eye blink detector evaluation

olution 640 × 480 pixels and an average IOD 62.9

We evaluate on two standard databases with

pixels. They contain about 50 blinks on average per

ground-truth annotations of blinks. The first one is

video. Each frame belonging to a blink is annotated

ZJU [11] consisting of 80 short videos of 20 sub-

by half-open or close state of the eyes. We consider

jects. Each subject has 4 videos: 2 with and 2 without

half blinks, which do not achieve the close state, as

glasses, 3 videos are frontal and 1 is an upward view.

full blinks to be consistent with the ZJU.

The 30fps videos are of size 320 × 240 px. An av-

Besides testing the proposed EAR SVM methods,

erage video length is 136 frames and contains about

that are trained to detect the specific blink pattern,

3.6 blinks in average. An average IOD is 57.4 pixels.

we compare with a simple baseline method, which

In this database, subjects do not perform any notice-

only thresholds the EAR in Eq. (1) values. The EAR

able facial expressions. They look straight into the

SVM classifiers are tested with both landmark detec-

camera at close distance, almost do not move, do not

tors Chehra [1] and Intraface [16].



section with detected blinks. The number of false

negatives is counted as a number of the ground-truth

blinks which do not intersect detected blinks. The

number of false positives is equal to the number of

detected blinks minus the number of true positives

plus a penalty for detecting too long blinks. The

penalty is counted only for detecting blinks twice

longer then an average blink of length A. Every long

blink of length L is counted L times as a false posi-

A

tive. The number of all possibly detectable blinks is

computed as number of frames of a video sequence

divided by subject average blink length following

Drutarovsky and Fogelton [8].

Eye aspect ratio:

The ZJU database appears relatively easy.

It

0.4

mostly holds that every eye closing is an eye blink.

0.2

Consequently, the precision-recall curves shown in

0

Fig. 8a of the EAR thresholding and both EAR SVM

EAR thresholding (t = 0.2)

classifiers are almost identical. These curves were

blink

calculated by spanning a threshold of the EAR and

SVM output score respectively.

All our methods

outperform other detectors [9, 8, 5]. The published

non-blink

methods presented the precision and the recall for a

EAR SVM output:

single operation point only, not the precision-recall

blink

curve. See Fig. 8a for comparison.

The precision-recall curves in Fig. 8b shows eval-

non-blink

uation on the Eyeblink8 database. We observe that in

this challenging database the EAR thresholding lags

Ground-truth:

behind both EAR SVM classifiers. The thresholding

blink

fails when a subject smiles (has narrowed eyes - see

half

an example in Fig. 7), has a side view or when the

non-blink

subject closes his/her eyes for a time longer than a

blink duration. Both SVM detectors performs much

Figure 7: Example of detected blinks where the

better, the Intraface detector based SVM is even a

EAR thresholding fails while EAR SVM succeeds.

little better than the Chehra SVM. Both EAR SVM

The plots of the eye aspect ratio EAR in Eq. (1), re-

detectors outperform the method by Drutarovsky and

sults of the EAR thresholding (threshold set to 0.2),

Fogelton [8] by a significant margin.

the blinks detected by EAR SVM and the ground-

Finally, we measured a dependence of the whole

truth labels over the video sequence. Input image

blink detector accuracy on the average IOD over the

with detected landmarks (depicted frame is marked

dataset. Every frame of the ZJU database was sub-

by a red line).

sampled to 90%, 80%, ..., 10% of its original reso-

lution. Both Chehra-SVM and Intraface-SVM were

used for evaluation. For each resolution, the area un-

The experiment with EAR SVM is done in a cross-

der the precision-recall curve (AUC) was computed.

dataset fashion. It means that the SVM classifier is

The result is shown in Fig. 9. We can see that with

trained on the Eyeblink8 and tested on the ZJU and

Chehra landmarks the accuracy remains very high

vice versa.

until average IOD is about 30 px. The detector fails

To evaluate detector accuracy, predicted blinks are

on images with the IOD < 20 px. Intraface land-

compared with the ground-truth blinks. The number

marks are much better in low resolutions. This con-

of true positives is determined as a number of the

firms our previous study on the accuracy of land-

ground-truth blinks which have a non-empty inter-

marks in Sec. 3.1.

100



1



A

B

90

0.9

C

0.8

80

0.7

70

0.6

60

0.5

AUC

50

0.4

0.3

40

Precision [%]

0.2

30

0.1

Chehra SVM

Intraface SVM

20

0

57.38 51.6

45.9

40.2

34.4

28.7

23.0

17.2

11.5

5.7

EAR Thresholding

IOD [px]

10

Chehra SVM

Intraface SVM

0

Figure 9: Accuracy of the eye blink detector (mea-



0

10

20

30

40

50

60

70

80

90

100

Recall [%]

sured by AUC) as a function of the image resolution

(average IOD) when subsampling the ZJU dataset.

(a) ZJU

100



phenomena as non-frontality, bad illumination, facial

90

expressions, etc.

80

A

State-of-the-art on two standard datasets was

70

achieved using the robust landmark detector fol-

lowed by a simple eye blink detection based on the

60

SVM. The algorithm runs in real-time, since the ad-

50

ditional computational costs for the eye blink detec-

40

Precision [%]

tion are negligible besides the real-time landmark de-

tectors.

30

The proposed SVM method that uses a temporal

20

window of the eye aspect ratio (EAR), outperforms

EAR Thresholding

10

the EAR thresholding. On the other hand, the thresh-

Chehra SVM

Intraface SVM

olding is usable as a single image classifier to detect

0 0

10

20

30

40

50

60

70

80

90

100

the eye state, in case that a longer sequence is not

Recall [%]

available.

(b) Eyeblink8

We see a limitation that a fixed blink duration for

all subjects was assumed, although everyone’s blink

Figure 8: Precision-recall curves of the EAR thresh-

lasts differently. The results could be improved by an

olding and EAR SVM classifiers measured on (a) the

adaptive approach. Another limitation is in the eye

ZJU and (b) the Eyeblink8 databases. Published re-

opening estimate. While EAR is estimated from a 2D

sults of methods A - Drutarovsky and Fogelton [8], B

image, it is fairly insensitive to a head orientation, but

- Lee et al. [9], C - Danisman et al. [5] are depicted.

may lose discriminability for out of plane rotations.

A solution might be to define the EAR in 3D. There

4. Conclusion

are landmark detectors that estimate a 3D pose (po-

sition and orientation) of a 3D model of landmarks,

A real-time eye blink detection algorithm was

e.g. [1, 3].

presented.

We quantitatively demonstrated that

regression-based facial landmark detectors are pre-

Acknowledgment

cise enough to reliably estimate a level of eye open-

ness. While they are robust to low image quality (low

The research was supported by CTU student grant

image resolution in a large extent) and in-the-wild

SGS15/155/OHK3/2T/13.

References

[16] X. Xiong and F. De la Torre. Supervised descent

methods and its applications to face alignment. In

[1] A. Asthana, S. Zafeoriou, S. Cheng, and M. Pantic.

Proc. CVPR, 2013. 2, 3, 4, 5

Incremental face alignment in the wild. In Confer-

[17] Z. Yan, L. Hu, H. Chen, and F. Lu. Computer vision

ence on Computer Vision and Pattern Recognition,

syndrome: A widely spreading but largely unknown

2014. 1, 2, 3, 4, 5, 7

epidemic among computer users. Computers in Hu-

[2] L. M. Bergasa, J. Nuevo, M. A. Sotelo, and

man Behaviour, (24):2026–2042, 2008. 1

M. Vazquez. Real-time system for monitoring driver

[18] F. Yang, X. Yu, J. Huang, P. Yang, and D. Metaxas.

vigilance. In IEEE Intelligent Vehicles Symposium,

Robust eyelid tracking for fatigue detection. In ICIP,

2004. 1

2012. 1

[3] J. Cech, V. Franc, and J. Matas. A 3D approach to

facial landmarks: Detection, refinement, and track-

[19] S. Zafeiriou, G. Tzimiropoulos, and M. Pantic. The

ing. In Proc. International Conference on Pattern

300 videos in the wild (300-VW) facial landmark

Recognition, 2014. 7

tracking in-the-wild challenge.

In ICCV Work-

shop, 2015.

http://ibug.doc.ic.ac.uk/

[4] M. Chau and M. Betke. Real time eye tracking and

resources/300-VW/. 3

blink detection with USB cameras. Technical Report

2005-12, Boston University Computer Science, May

2005. 1

[5] T. Danisman, I. Bilasco, C. Djeraba, and N. Ihad-

dadene. Drowsy driver detection system using eye

blink patterns.

In Machine and Web Intelligence

(ICMWI), Oct 2010. 1, 6, 7

[6] H. Dinh, E. Jovanov, and R. Adhami. Eye blink

detection using intensity vertical projection. In In-

ternational Multi-Conference on Engineering and

Technological Innovation, IMETI 2012. 1

[7] M. Divjak and H. Bischof.

Eye blink based fa-

tigue detection for prevention of computer vision

syndrome. In IAPR Conference on Machine Vision

Applications, 2009. 1

[8] T. Drutarovsky and A. Fogelton. Eye blink detec-

tion using variance of motion vectors. In Computer

Vision - ECCV Workshops. 2014. 1, 2, 5, 6, 7

[9] W. H. Lee, E. C. Lee, and K. E. Park. Blink detec-

tion robust to various facial poses. Journal of Neu-

roscience Methods, Nov. 2010. 1, 3, 6, 7

[10] Medicton group. The system I4Control. http://

www.i4tracking.cz/. 1

[11] G. Pan, L. Sun, Z. Wu, and S. Lao. Eyeblink-based

anti-spoofing in face recognition from a generic we-

bcamera. In ICCV, 2007. 1, 2, 5

[12] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment

at 3000 fps via regressing local binary features. In

Proc. CVPR, 2014. 2

[13] A. Sahayadhas, K. Sundaraj, and M. Murugappan.

Detecting driver drowsiness based on sensors: A re-

view. MDPI open access: sensors, 2012. 1

[14] F. M. Sukno, S.-K. Pavani, C. Butakoff, and A. F.

Frangi. Automatic assessment of eye blinking pat-

terns through statistical shape models.

In ICVS,

2009. 1, 2

[15] D. Torricelli, M. Goffredo, S. Conforto, and

M. Schmid. An adaptive blink detector to initial-

ize and update a view-basedremote eye gaze track-

ing system in a natural scenario. Pattern Recogn.

Lett., 30(12):1144–1150, Sept. 2009. 1





21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

Solving Dense Image Matching in Real-Time using Discrete-Continuous

Optimization

Alexander Shekhovtsov, Christian Reinbacher, Gottfried Graber and Thomas Pock Institute for Computer Graphics and Vision, Graz University of Technology

{shekhovtsov,reinbacher,graber,pock}@icg.tugraz.at

Abstract. Dense image matching is a fundamental low-

Input

level problem in Computer Vision, which has received

tremendous attention from both discrete and continuous

optimization communities. The goal of this paper is to

combine the advantages of discrete and continuous op-

timization in a coherent framework. We devise a model

based on energy minimization, to be optimized by both

discrete and continuous algorithms in a consistent way.

Discrete

In the discrete setting, we propose a novel optimization

algorithm that can be massively parallelized. In the con-

tinuous setting we tackle the problem of non-convex reg-

ularizers by a formulation based on differences of convex

functions. The resulting hybrid discrete-continuous algo-

rithm can be efficiently accelerated by modern GPUs and

we demonstrate its real-time performance for the applica-

tions of dense stereo matching and optical flow.

Continuous

1. Introduction

The dense image matching problem is one of the most

basic problems in computer vision: The goal is to find

matching pixels in two (or more) images. The applica-

tions include stereo, optical flow, medical image registra-

Combined

tion, face recognition [1], etc. Since the matching problem is inherently ill-posed, typically optimization is involved

in solving it. We can distinguish two fundamentally dif-

ferent approaches: discrete and continuous optimization.

Whereas discrete approaches (see [14] for a recent com-

parison) assign a distinct label to each output pixel, con-

tinuous approaches try to solve for a function using the

calculus of variations [6, 8, 21]. Both approaches have

data term

Large motion Parallelization

received enormous attention, and there exist state-of-the-

Discrete

Arbitrary (sampled)

Easy

Difficult

art algorithms in both camps: continuous [23, 24, 28] and Continuous Convex (linearized)

Difficult

Easy

discrete [18, 30]. Due to the specific mathematical tools available to solve the problems (discrete combinatorial op-Figure 1: Optical flow problem solved by a purely discrete

method, a purely continuous method and the combined method.

timization vs. continuous calculus of variations), both ap-

All methods are as described in this paper, they use the same data

proaches have distinct advantages and disadvantages.

term and are run until convergence here. In the discrete solu-

In this paper, we argue that on a fundamental level the

tion we can see small scale details and sharp motion boundaries

advantages and disadvantages of discrete and continuous

but also discretization artifacts. The continuous solution exhibits

optimization for dense matching problems are comple-

sub-pixel accuracy (smoothness), but lacks small details and has

mentary as summarized in Figure 1. The previous work

difficulties with large motions. The combined solution delivers

combining discrete and continuous optimization primar-

smooth flow fields while retaining many small scale details.

ily used discrete optimization to fuse (find the optimal

crossover) of candidate continuous proposals, e.g. [36, 30]

optimization method, reducing non-convex regularizers to

(stereo) and [25] (flow).

The latter additionally per-

a primal-dual method with non-linear operators [31]; iii) forms local continuous optimization of the so-found solu-an efficient implementation of both methods on GPU and

tion. Many works also alternate between continuous and

proof of concept experiments showing advantages of the

discrete optimizations, addressing a Mumford-Shah-like

combined approach.

model, e.g., [5]. Similarly to [25] we introduce a continuous energy which is optimized using a combined method.

2. Method

However, we work with a full (non-local) discretization of

In this section we will describe our two-step approach

this model and propose new parallel optimization meth-

to the dense image matching problem. To combine the

ods.

previously discussed advantages of discrete and continu-

The basic difference in discrete and continuous ap-

ous optimization methods it is essential to minimize the

proaches lies in the handling of the data term. The data

same energy in both optimization methods. Starting from

term is a measure how well the solution (i.e. value of a

a continuous energy formulation in § 2.1, we first show

pixel) fits the underlying measurement (i.e. input images).

how to discretize the energy in § 2.2 and subsequently

In the discrete setting, the solution takes discrete labels,

minimize it using a novel discrete parallel block coordi-

and hence the number of labels is finite. Typically the

nate descent, described in § 2.3. The output of this algo-data cost is precomputed for all possible labels. The dis-

rithm will be the input to a refinement method which is

crete optimization then uses the data cost to find the opti-

posed as a continuous optimization problem, solved by a

mal label for each pixel according to a suitable model in

non-linear primal-dual algorithm described in § 2.4.

an energy minimization framework. We point out that due

to the sampling in both label space and spatial domain, the

2.1. Model

discrete algorithm has access to the full information at ev-

Let us formally define the dense image matching prob-

ery step. I.e. it deals with a global optimization model and

lem to be addressed by the discrete-continuous optimiza-

in some lucky cases can find a globally optimal solution to

tion approach. In both formulations we consider that the

it or provide an approximation ratio or partial optimality

image domain is a discrete set of pixels

guarantees [27].

V. The continuous

formulation has continuous ranged variables u = (uk

In the continuous setting, the solution is a continuous

i ∈

function. This means it is not possible to precompute the

R | k = 1, . . . d, i ∈ V), where d = 1, 2 for stereo / flow,

respectively. The matching problem is formulated as

data cost; an infinite number of solutions would require

infinite amount of memory. More importantly, the data

h

i

min E(u) = D(u) + R(Au) ,

(1)

cost is a non-convex function stemming from the similar-

u∈U

ity measure between the images. In order to make the

where U =

d×V ; D is the data term and R(Au) is a

optimization problem tractable, a popular approach is the

R

regularizer (A is a linear operator explained below). The

linearization of the data cost. However, this introduces

discrete formulation will quantize variable ranges.

a range of new problems, namely the inability to deal

Data Term We assume D(u) = P

D

with large motions due to the fact that the linearization is

i∈V

i(ui), where

D

d

valid only in a small neighborhood around the lineariza-

i : R

→ R encodes the deviation of ui from some un-

tion point. Most continuous methods relying on lineariza-

derlying measurement. A usual choice for dense image

tion therefore use a coarse-to-fine framework in an attempt

matching are robust filters like Census Transform or Nor-

to overcome this problem [4]. One exception is a recent

malized Cross Correlation, computed on a small window

work [16], which can handle piece-wise linear data terms around a pixel. This data term is non-convex in u and

and truncated TV regularization.

piecewise linear. In the discrete setting, the data term is

Our goal in this paper is to combine the advantages of

sampled at discrete locations, in the continuous setting,

both approaches, as well as real-time performance, which

the data term is convexified by linearizing or approximat-

imposes tough constraints on both methods resulting in a

ing it around the current solution. The details will be de-

number of challenges:

scribed in the respective sections.

Challenges

Regularization Term

The discrete optimization method needs to

The regularizer encodes prop-

be highly parallel and able to

erties of the solution of the energy minimization like local

couple the noisy / ambigu-

ous data over large areas. The continuous energy should

smoothness or preservation of sharp edges. The choice of

be a refinement of the discrete energy so that we can evalu-

this term is crucial in practice, since the data term may be

ate the two-phase optimization in terms of a single energy

unreliable or uninformative in large areas of dense match-

function. The continuous method needs to handle robust

ing problems. We assume

(truncated) regularization terms.

d

X

X

Contribution Towards the posed challenges, we pro-

R(Au) =

ωij

r((Auk)ij),

(2)

pose: i) a new method for the discrete problem, working in

ij∈E

k=1

the dual (i.e. making equivalent changes of the data cost

where E ⊂ V × V is the set of edges, i.e., pairs of

volume), in parallel on multiple chains; ii) a continuous

neighboring pixels; linear operator A :

V

E

R

→ R : uk 7→

(uk

Stereo We discretize a range of disparities and let

i − uk

j ∈ R | ∀ij ∈ E ) essentially computes gradients

along the edges in E for the solution dimension k; the gra-

u(x) ∈

V

R

denote the continuous solution correspond-

dients are penalized by the penalty function r : R → R

ing to the labeling x. We set fi(xi) = Di(u(xi)) and

and ωij are image dependent per-edge strength weights,

fij(xi, xj) = ωijr((Au(x))ij).

reducing the penalty around sharp edges. Our particular

Flow Discretization of the flow is somewhat more

choice for the penalty function r is depicted in Fig. 2. We challenging. Since ui is a 2D vector, assuming large dis-chose to use a truncated norm which has shown to be ro-

placements, discretizing all combinations is not tractable.

bust against noise that one typically encounters in dense

Instead, components u1 and u2 can be represented as sep-

i

i

matching problems. It generalizes truncated Total Vari-

arate discrete variables xi1 , xi2 , where (i1, i2) is a pair

ation in the continuous setting. In the discrete setting it

of nodes duplicating i, leading to the decomposed formu-

generalizes the P 1-P 2 penalty model [11], Potts model

lation [26]. To retain the pairwise energy form (3), this and the truncated linear model.

approach assigns the data terms Di(ui) to a pairwise cost

fi1i2(xi1, xi2) and the regularization is imposed on each

r

layer of variables (xi1 | i ∈ V) and (xi2 | i ∈ V) sepa-

+(t)

4

rately. To this end, we tested a yet simpler representation,

r−(t)

in which we assign optimistic data costs, given by

r(t)

fi1(xi1) = minx Di(xi1, xi2),

(5a)

i2

2

fi2(xi2) = minx Di(xi1, xi2),

(5b)

i1

C

where D

δ

i(xi1 , xi2 ) is the discretized data cost, and reg-

εδ

0

ularize in each layer individually. This makes the two

−4

−2

0

2

4

layers fully decouple into, essentially, a two indepen-

dent stereo-like problems. At the same time, the cou-

Figure 2: Regularizer function r. In our continuous optimiza-

pled scheme [26], passing messages between the two lay-

tion method it is decomposed into a difference of convex func-

ers, differs merely in recomputing (5) for a reparametrized tions r+ −r−. For the discrete optimization it is sampled at label

data costs in a loop. Our simplification then is not a prin-

locations depicted as dots.

cipled limitation but an intermediate step.

2.3. Discrete Optimization

2.2. Discrete Formulation

In this section we give an overview of a new method

In the discrete representation we will use the following

under development addressing problem (4) through its

formalism. To a continuous variable u

LP-relaxation dual. In real-time applications like stereo

i we associate a

discrete variable x

and flow there seem to be a demand in methods per-

i ∈ L. The discrete label space L can

be chosen to our convenience as long as it has the desired

forming fast approximate discrete optimization, prefer-

number of elements, denoted K. We let L to be vectors

ably well-parallelizable. It has motivated a significant re-

in {0, 1}K with exactly one component equal 1 (the 1-hot

search. The challenge may sound as “best solution in a

encoding of natural numbers from 1 to K). For f

K

limited time budget”.

i ∈ R

we denote f

K×K

Well-performing methods, from local to global, range

i(xi) = hfi, xii = f Tx

i

i and for fij ∈ R

we denote f

from cost volume filtering [12], semi-global match-

ij (xi, xj ) = xTf

i

ij xj .

Let f = (fw | w ∈

V ∪ E) denote the energy cost vector. The energy function

ing (SGM) [11] (has been implemented in GPU and

corresponding to the cost vector f is given by

FPGA [2]), dynamic programming on spanning trees ad-

justing the cost volume [3] and more-global matching

X

X

f (x) =

fi(xi) +

fij(xi, xj).

(3)

(MGM) [10] to the sequential dual block coordinates

i∈V

ij∈E

methods, such as TRW-S [15]. Despite being called se-

quential, TRW-S exhibits a fair amount of parallelism in

Whenever we need to refer to f as a function and not as

its computation dependency graph, which is exploited in

the cost vector, we will always use the argument notation,

the parallel GPU/FPGA implementations [7, 13]. At the

e.g. f (x) ≥ g(x) is different from f ≥ g.

same time SGM has been interpreted [9] as a single step

Energy function f that can be written as P f

i

i(xi) =

of parallel TRW algorithm [32] developed for solving the hf, xi is called modular, separable or linear. Formally,

dual. MGM goes further in this direction, resembling even

all components fij of f are identically zero. If fij is non-

more the structure of a dual solver: it combines together

zero only for a subgraph of (V, E) which is a set of chains,

more messages but in a heuristic fashion and introducing

we say that f is a chain.

more computation dependencies, in fact similar to TRW-

The discrete energy minimization problem is defined as

S. It appears that all these approaches go somehow in the

direction of a fast processing of the dual.

min f (x).

(4)

x∈LV

We propose a new dual update scheme, which: i) is a

monotonous block-coordinate ascent; ii) performs as good

as TRW-S for an equal number of iterations while having a Algorithm 1: Primal MM

comparable iteration cost; and iii) offers more parallelism,

Input: Initial primal point xk;

better mapping to current massively parallel compute ar-

Output: New primal point xk+2;

chitectures. Thus it bridges the gap between highly paral-

¯

1 f f , ¯

f (xk) = f (xk);

/* Majorize */

lel heuristics and the best “sequential” dual methods with-

2 xk+1 ∈ argmin( ¯

f + g)(x);

/

out compromising on the speed and performance.

* Minimize */

x

On a higher level, the method is most easily presented

3 ¯

g g, ¯g(xk+1) = g(xk+1);

/* Majorize */

in the dual decomposition framework. For clarity, let us

4 xk+2 ∈ argmin(f + ¯

g)(x);

/* Minimize */

consider a decomposition into two subproblems only (hor-

x

izontal and vertical chains). Consider minimizing the en-

ergy function E(x) that separates as

lem:

h

i

E(x) = f (x) + g(x),

(6)

max min f (x) + hλ, xi + min g(x) − hλ, xi . (8)

λ

x

x

|

{z

}

|

{z

}

where f, g : LV → R are chains.

D1(λ)

D2(λ)

Primal Majorize-Minimize Even before introducing

the dual, we can propose applying the majorize-minimize

The so-called slave problems D1(λ) and D2(λ) have the

method (a well-known optimization technique) to the pri-

form of minimizing an energy function with a data cost

mal problem in the form (6). It is instructive for the subse-modified by λ. The goal of the master problem (8) is to

quent presentation of the dual method and has an intrigu-

balance the data cost between the slave problems such that

ing connection to it, which we do not yet fully understand.

their solutions agree. The slave problems are minima of

finitely many functions linear in λ, the objective of the

Definition 2.1. A modular function ¯

f is a majorant (up-

master problem (8) D(λ) = D1(λ) + D2(λ) is thus a

per bound) of f if (∀x)

¯

f (x) ≥ f(x), symbolically

concave piece-wise linear function. Problem (8) is a con-

¯

f f. A modular minorant f of f is defined similarly.1

¯

cave maximization. However, since x was taking values in

Noting that minimizing a chain function plus a modu-

a discrete space, there is only a weak duality: (7) ≥ (8). It lar function is easy, one could straightforwardly propose

is known that (8) can be written as a linear program (LP), Algorithm 1, which alternates between majorizing one of

which is as difficult in terms of computation complexity

f or g by a modular function and minimizing the result-

as a general LP [22].

ing chain problem ¯

f + g (resp. f + ¯

g). We are not aware

Dual Minorize-Maximize

In the dual, which is a

of this approach being evaluated before. Somewhat novel,

maximization problem, we will speak of a minorize-

the sum of two chain functions is employed rather than,

maximize method. The setting is similar to the primal.

say, difference of submodular [19], but the principle is the We can efficiently maximize D1, D2 but not D1 + D2.

same. To ensure monotonicity of the algorithm we need

Suppose we have an initial dual point λ0 and let x0 ∈

to pick a majorant ¯

f of f which is exact in the current pri-

argmin (f + λ0)(x)

x

be a solution to the slave subprob-

mal solution xk as in Line 1. Then f (xk+1) + g(xk+1) ≤

lem D1, that is, D1(λ0) = f (x0) + λ0(x0).

¯

f (xk+1) + g(xk+1) ≤ ¯

f (xk) + g(xk) = f (xk) + g(xk).

Proposition 2.2. Let f be a modular minorant of f exact

Steps 3-4 are completely similar. Algorithm 1 has the fol-

¯

in x0 and such that f + λ0 ≥ D1(λ0) (component-wise).

lowing properties:

¯

Then the function D1(λ) = min f +λ)(x) is a minorant

• primal monotonous;

¯

x(¯

•

of D1(λ) exact at λ = λ0.

parallel, since, e.g., minx( ¯

f + g)(x) decouples over

all vertical chains;

Proof. Since f (x) ≤ f (x) for all x it follows that

• uses more information about subproblem f than

¯

minx(f + λ)(x) ≤ minx(f + λ)(x) for all λ and there-

just the optimal solution (as in most primal block-

¯

fore D1 is a minorant of D1. Next, on one hand we

coordinate schemes: ICM, alternating lines, etc.).

¯

have D1(λ0) ≤ D1(λ0) and on the other, D1(λ0) ≤

The performance of this method highly depends on the

¯

(f + λ0)(x) for all x and thus D1(λ0) ≤ D1(λ0).

strategy of choosing majorants. This will be also the main

¯

¯

question to address in the dual setting.

We have constructed a minorant of D1 which is itself

Dual Decomposition Minimization of (6) can be writ-

a (simple) piece-wise linear concave function. The maxi-

ten as

mization step of the minorize-maximize is to solve

min f (x1) + g(x2).

(7)

max(D1(λ) + D2(λ)).

(9)

¯

x1=x2

λ

Proposition 2.3. λ∗ =

f is a solution to (9).

Introducing a vector of Lagrange multipliers λ ∈

L×V

−

R

¯

for the constraint x1 = x2, we get the Lagrange dual prob-

Proof. Substituting λ∗ into the objective (9) we obtain

D1(λ∗) + D2(λ∗) = min f

f )(x) + D2( f ) =

1

−

−

f reads “f underbar”.

¯

x(

¯

¯

¯

¯

minx(f + g)(x). This value is the maximum because

¯

Algorithm 2: Dual MM

allows to perform 5 iterations of Algorithm 2 for an image Input: Initial dual point gk;

512×512 and 64 labels at the rate of about 30 fps.

¯

Output: New dual point gk+2;

¯

9500



1 xk ∈ argmin (f + gk)(x)

TRW−S

x

;

/* Minimize */

¯

TRW−S primal

/

9000

* Minorize

*/

DMM−uniform

DMM−uniform primal

2 f k+1 f , f k+1(xk) = f (xk),

¯

¯

8500

DMM−naive

f k+1 + gk ≥ f(xk) + gk(xk);

DMM−naive primal

¯

¯

¯

8000

3 xk+1 ∈ argmin (f k+1 + g)(x)

x

; /* Minimize */

¯

/* Minorize

*/

7500

4 gk+2 g, gk+2(xk+1) = g(xk+1),

¯

¯

f k+1 + gk+2 ≥ fk+1(xk+1) + g(xk+1);

7000

¯

¯

¯

6500

D1(λ) + D2(λ) = min f + λ)(x) + min

¯

x(

x(g − λ)(x) ≤

¯

6000

minx(f + λ + g − λ)(x) = minx(f + g)(x).

¯

¯

5500

Note, for the dual point λ = −f, in order to construct

0

2

4

6

8

10

12

14

16

18

20

¯

a minorant of D2 (similarly to Proposition 2.2) we need

Figure 3: Lower bounds and best primal solutions by TRW-S

to find a solution to the second slave problem,

and by Dual MM with a naive and a uniform minorants. The problem is a small crop from stereo of size 40×40, 16 labels,

x1 ∈ argmin(g − λ)(x) = argmin(f + g)(x).

(10)

¯

truncated linear regularization. On the x-axis one iteration is

We obtain Algorithm 2 with the following properties:

a forward-backward pass of TRW-S vs. iteration of Dual MM

•

(equal number of updates per pixel). With a good choice of mi-

It builds the sequence of dual points given by λ2t =

norant, Dual MM can perform even better than the sequential g2t, λ2t+1 = −f2t+1 and the dual objective does not

¯

¯

baseline in terms of iterations. Parallelizing it can be expected

decrease on each step;

to give a direct speedup.

• The minimization subproblems and minorants are

decoupled (can be solved in parallel) for all horizon-

tal (resp. vertical) chains;

2.4. Continuous Refinement

• When provided good minorants (see below) the algo-

rithm has same fixed points as TRW-S [15];

In this section we describe the continuous refinement

• Updating only a single component λ

method, which is based on variational energy minimiza-

i for a pixel i is

a monotonous step as well, therefore the algorithm is

tion. The goal of this step is to refine the output of the

a parallel block-coordinate ascent.

optimization method described in § 2.3 which is discrete Notice also that Dual MM and Primal MM are very sim-in label-space.

ilar, nearly up to replacing minorants with majorants. The

To that end, it is important to minimize the same en-

sequence {E(xk)}

ergy in both formulations. Considering the optimization

k is monotonous in Algorithm 1 but not

in Algorithm 2.

problem in (1), we are seeking to minimize a non-convex, Good and Fast Minorants The choice of the mino-truncated norm together with a non-convex data term. For

rant in Dual MM is non-trivial as there are many, which

clarity, let us write down the problem again:

makes it sort of a secrete ingredient. Figure 3 illustrates min D(u) + R(Au).

(11)

two of the possible choices. The naive minorant for a

u∈U

chain problem f + λ is constructed by calculating its min-

marginals and dividing by chain length to ensure that the

Non-Convex Primal-Dual Efficient algorithms exist

simultaneous step is monotonous (c.f . tree block update

to solve (11) in case both D(u) and R(Au) are convex

algorithm of Sontag and Jaakkola [29, Fig. 1]). The uni-

(but possibly non-smooth), e.g. the primal-dual solver of

form minorant is found through the optimization proce-

Chambolle and Pock [6]. Kolmogorov et al. [16] solves dure that tries to build the tightest modular lower bound,

(11) for a truncated total variation regularizer using a split-by increasing uniformly all components that are not yet

ting into horizontal and vertical 1D problems and applying

tight. The details are given in §A. In practice, we build

[6] to the Lagrangian function. Here we will use a recently fast minorants, which try to approximate the uniform one

proposed extension to [6] by Valkonen [31]. He considers using fast message passing operations. Parallelization of

problems of the form minx G(x) + F(A(x)), i.e. of the

decoupled chains allowed us to achieve an implementa-

same structure as (11), where G and F are convex, G is

tion which, while having the same number of memory ac-

differentiable and A(u) is a twice differentiable but pos-

cesses as TRW-S (including messages / dual variables),

sibly non-linear operator. In the primal-dual formulation,

saturates the GPU memory bandwidth, ∼ 230GB/s.2 This

the problem is written as

2This is about 10 times faster than reported for FPGA implementa-

min max G(x) + hA(x), yi − F∗(y),

(12)

tion [7] of TRW-S.

x

y

where ∗ is the convex conjugate. Valkonen proposes the

To compute the proximal map (I + σ∂F∗)−1(ˆ

y) we

following modified primal-dual hybrid gradient method:

first need the convex conjugate of ωijrα,β(t). It is given

by (ωijrα,β)∗(t∗) =

xk+1 =(I + τ ∂G)−1(xk − τ ∇A(xk)T yk)

(13a)

(max(0, β|t∗| − ωijαβ) if α < |t∗| < ωij

yk+1 =(I + σ∂F∗)−1(yk + σA(2xk+1 − xk)). (13b)

.

(21)

∞

else

Reformulation In order to apply method [31], we will

The proximal map for (ωijrα,β)∗ at t∗ ∈ R is given by

reformulate the non-convex problem (11) to the form (12).

¯

t = clamp(±ωij, t0), where clamp(±ωij, ·) denotes a

We start by formulating the regularizer R(Au) as a differ-

clamping to the interval [−ωij, ωij] and

ence of convex functions: R(Au) = R+(Au) − R−(Au),

(

where R+ and R− are convex. The primal-dual formula-

t∗

if |t∗| ≤ αωij

t0 =

tion of (11) then reads

(22)

max(αωij, |t∗|−βσ) sign(t∗)

else.

h

min max(hAu, pi − R∗

Proximal map (I + σ∂

+(p))

(14)

F∗)−1(ˆy) is calculated by applying

u

p

expression (22) component-wise to ˆ

y. The proximal map

i

+ max(hAu, qi − R∗−(q)) + D(u) .

(I + τ ∂G)−1 depends on the choice of the data term D(u)

q

and will thus be defined in § 3.

Because minx −f(x) = − maxx f(x), (14) equals

3. Applications

h

min max(hAu, pi − R∗+(p))+

(15)

3.1. Stereo Reconstruction

u

p

i

+ min(−hAu, qi + R∗

For the problem of estimating depth from two images,

−(q)) + D(u) .

q

we look at a setup of two calibrated and synchronized

cameras. We assume that the input images to our method

Grouping terms we arrive at

have been rectified according to the calibration parameters

h

i

of the cameras. We aim to minimize the energy (1) where

min max hAu, p−qi−R∗+(p)+R∗−(q)+D(u) . (16)

u,q

p

u encodes the disparity in x-direction. The data term mea-

sures the data fidelity between images I1 and I2, warped

The problem now arises in minimizing the bilinear term

by the disparity field u. As a data term we use the Census

hAu, qi in (16) in both u and q. We thus move this term

Transform [37] computed on a small local patch in each

into the nonlinear operator A(x) and rewrite (16) as

image. The cost is given by the pixel-wise Hamming dis-

tance on the transformed images.D(u) is non-convex in

*

+



Au



p

the argument u which makes the optimization problem in

min max

,

+ R∗−(q) + D(u)

u,q p,d=1

−hAu, qi

d

(1) intractable in general.

|

{z

}

|{z} | {z }

|

{z

}

G(x)

We start by minimizing (1) using the discrete method

x

y

A(x)

(§2.3) in order to obtain an initial solution ˚

u. We approx-

− R∗+(p) (17)

imate the data term around the current point ˚

u by a piece-

|

{z

}

wise linear convex function ˜

D(u) =

F∗(y)

(

by introducing a dummy variable d = 1.

s

D(˚

u) + δ

1(u − ˚

u)

if u ≤ ˚

u

[˚

u−h,˚

u+h](u) +

(23)

Implementation Details The gradient of A needed by

s2(u − ˚

u)

otherwise

iterates (13) is given by

with s1 = D(˚

u+h)−D(˚

u) and s

for a

h

2 = D(˚

u)−D(˚

u+h)

h



A

0

small h. To ensure convexity, we set s

if

∇A(x) =

.

1 = s2 = s1+s2

(18)

2

−ATq

−Au

s2 < s1. The indicator function δ is added to ensure that

the solution stays within ˚

u ± h where the approximation

The regularization function r is represented as a difference

is valid. We then apply the continuous method (§2.4). The of two convex functions (see Figure 2):

proximal map ¯

u = (I + τ ∂G)−1(û) needed by the algo-

rithm (13) for the approximated data term expresses as the r(t) = rε,δ(t) − r0,(C+δ−εδ)(t),

(19)

pointwise soft-thresholding

where



τs





1,i

if ûi > ˚

ui + τ s1,i

(



α|t|

if |t| ≤ β

¯

u

˚

u



r

i = clamp

i ± h, ûi − τs2,i

if ûi < ˚

ui + τ s2,i

α,β (t) =

(20)





|t| − β(1 − α) else



0

otherwise

is convex for α ≤ 1. Convex functions R+(Au) and

In practice, the minimization has to be embedded in a

R−(Au) are defined by decomposition (19) and (2).

warping framework: after optimizing for n iterations, the

data term is approximated anew at the current solution u.





3.2. Optical Flow

(a) Input

(b) Groundtruth

The optical flow problem for two images I1, I2 is posed

again as model (1). In contrast to stereo estimation, we now have u

2

i

∈ R encoding the flow vector. For the

discrete optimization step (§2.3) the flow problem is decoupled into two independent stereo-like problems as dis-

cussed in §2.2.

(c) TV regularization

(d) Proposed Method

For the continuous refinement step, the main prob-

lem is again the non-convexity of the data term.

In-

stead of a convex approximation with two linear slopes

we build a quadratic approximation, now in 2D, follow-

ing [34]. The approximated data term reads ˜

Di(ui) =

δ[˚

ui−h,˚

ui+h](ui)+

1

Figure 4: Influence of the robust regularizer in the continuous

Di(˚

ui) + LT(u

(u

i

i − ˚

ui) +

refinement on stereo reconstruction quality.

2

i − ˚

ui)TQi(ui − ˚

ui), (24)

(a) Refinement

(b) No Refinement

where L

2

2×2

i ∈ R

and Qi ∈ R

are finite difference ap-

proximations of the gradient and the Hessian with step-

size h. Convexity of (24) is ensured by retaining only

positive-semidefinite part of Qi as in [34]. The proximal map ¯

u = (I + τ ∂G)−1(û) for data term (24) is given

point-wise by



ûk + τ (Q



¯

uk = clamp ˚

uk

i

i˚

ui − Li)k .

Figure 5: Influence of continuous refinement on the reconstruc-

i

i ± h,

(25)

1 + τ Lki

tion quality of KinectFusion.

Optimizing (1) is then performed as proposed in §2.4.

For the purpose of this experiment we replace the Kinect

4. Experiments

with a Point Grey Bumblebee2 stereo camera. KinectFu-

sion can only handle relatively small camera movements

4.1. Stereo Reconstruction

between images, so a high framerate is essential. We set

We evaluate our proposed real-time stereo method on

the parameters to our method to achieve a compromise

datasets where Ground-Truth data is available as well as

between highest quality and a framerate of ≈ 4 − 5 fps:

on images captured using a commercially available stereo

camera resolution 640 × 480, 128 disparities, 4 iterations

camera.

of Dual MM, 5 warps and 40 iterations per warp of the

continuous refinement.

Influence of Continuous Refinement The first stage

4.1.1

Influence of Truncated Regularizer

of our reconstruction method, Dual MM, already delivers

We begin by comparing the proposed method to a sim-

high quality disparity images that include details on fine

plified version that does not use a truncated norm as reg-

structures and depth discontinuities that are nicely aligned

ularizer but a standard Total Variation. We show the ef-

with edges in the image. In this experiment we want to

fect of this change in Fig. 4, where one can observe much show the influence of the second stage, the continuous re-sharper edges, when using a robust norm in the regulariza-

finement, on the reconstruction quality of KinectFusion.

tion term. On the downside it is more sensitive to outliers,

To that end we mount the camera on a tripod and collect

which however can be removed in a post-processing step

300 depthmaps live from our full method and 300 frames

like a two-side consistency check.

with the continuous refinement switched off. By switch-

ing off the camera tracking, the final reconstruction will

4.1.2

Live Dense Reconstruction

show us the artifacts produced by the stereo method. Fig-

ure 5 depicts the result of this comparison. One can easily To show the performance of our stereo matching method

see that the output of the discrete method contains fine de-

in a real live setting, we look at the task of creating a

tails, but suffers from staircasing artifacts on slanted sur-

live dense reconstruction from a set of depth images. To

faces due to the integer solution. The increase in qual-

that end, we are using a reimplementation of KinectFusion

ity due to the refinement stage can be especially seen on

proposed by Newcombe et al. [20] together with the out-

far away objects, where a disparity step of 1 pixel is not

put of our method. This method was originally designed

enough to capture smooth surfaces.

to be used with the RGBD output of a Microsoft Kinect

Timing To show the influence of the individual steps

and tracks the 6 DOF position of the camera in real-time.

in our stereo method on runtime, we break down the total





Cost Vol.

Discrete

Cont. Ref.

Total

Inputs

27 ms

73 ms

39 ms

139 ms

Table 1: Runtime analysis of the individual components of our

stereo matching method. Details regarding computing hardware

Werlberger [33]

Combined

and parameters are in the text. In case of the full left-right check

procedure the total computation time doubles.

(a) Input

(b) Reconstruction

Figure 6: Qualitative result of reconstructing a desktop scene

using KinectFusion3.

time of ≈ 140 ms per frame in Table 1. Those timings

have been achieved using a PC with 32 GB RAM with a

NVidia 980GTX, running Linux.

Qualitative Results To give an impression about the

quality of the generated depthmaps and the speed of our

method, we run our full algorithm and aim to reconstruct

a desktop scene with a size of 1 × 1 × 1 meters and show

some renderings in Fig. 6. To better visualize the quality Figure 7: Subjective comparison of variational approach [33]

of the geometry, the model is rendered without texture3.

(left) with our combined method (right). Top row show input

4.2. Optical Flow

images, one from a pair. Both methods use the same data term.

Parameters of both algorithms have been tuned by hand to de-

In this section we show preliminary results of our

liver good results. Note that for [33] it is often impossible to get algorithm applied to optical flow.

A further improve-

sharp motion boundaries as well as small scale details, despite a

ment in quality can be expected by exploiting the coupled

very strong data term (e.g. artifacts in left image, first row).

scheme [26] in the discrete optimization, as discussed in

§ 2.2. As depicted in Figure 7, our method is able to de-tion is sufficiently localized, continuous representation in-

liver reasonable results on a variety of input images. We

creases the accuracy of the model as well as optimization

deliberately chose scenes that contain large motion as well

speed. In the continuous optimization, we experimented

as small scale objects, to highlight the strengths of the

with non-convex models and showed a reduction allowing

discrete-continuous approach. For comparison, we use a

to handle them with the help of a recent non-linear primal-

state-of-the-art purely continuous variational optical flow

dual method. This in turn allowed to speak of a global

algorithm [33]. The runtime of our method is 2s for an

model to be solved by a discrete-continuous optimization.

image of size 640 × 480.

Ideally, we would like to achieve a method, which,

when given enough time, produces an accurate solution,

5. Conclusion

and in the real time setting gives a robust result. We plan

further to improve on the model. A vast literature on the

The current results demonstrate that it is feasible to

topic suggest that modeling occlusions and using planar

solve dense image matching problems using global op-

hypothesis can be very helpful. At the same time, we are

timization methods with a good quality in real time. We

interested in a tighter coupling of discrete and continuous

have proposed a highly parallel discrete method, which

optimization towards a globally optimal solution.

even when executed sequentially, is competitive with the

best sequential methods. As a dual method, we believe,





Acknowledgements


it has a potential to smoothly handle more complex mod-

els in the dual decomposition framework and is in theory

This work was supported by the research initiative

applicable to general graphical models. When the solu-

Mobile Vision with funding from the AIT and the Aus-

3

trian Federal Ministry of Science, Research and Economy

We point the interested reader to a video that shows the reconstruc-

tion pipeline in real-time: http://gpu4vision.icg.tugraz.

HRSM programme (BGBl. II Nr. 292/2012).

at/videos/cvww16.mp4

References

[17] Lawler, E. (1966). Optimal cycles in doubly weighted di-

rected linear graphs. In Intl Symp. Theory of Graphs.

[1] Arashloo, S. R. and Kittler, J. (2014). Fast pose invariant

face recognition using super coupled multiresolution Markov

[18] Menze, M., Heipke, C., and Geiger, A. (2015). Discrete

random fields on a GPU. Pattern Recognition Letters, 48.

optimization for optical flow. In GCPR.

[2] Banz, C., Hesselbarth, S., Flatt, H., Blume, H., and Pirsch,

[19] Narasimhan, M. and Bilmes, J. (2005). A supermodular-

P. (2010). Real-time stereo vision system using semi-global

submodular procedure with applications to discriminative

matching disparity estimation:

Architecture and FPGA-

structure learning. In Uncertainty in Artificial Intelligence.

implementation. In ICSAMOS.

[20] Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D.,

[3] Bleyer, M. and Gelautz, M. (2008). Simple but effective tree

Kim, D., Davison, A. J., Kohli, P., Shotton, J., Hodges, S.,

structures for dynamic programming-based stereo matching.

and Fitzgibbon, A. (2011). Kinectfusion: Real-time dense

In VISAPP.

surface mapping and tracking. In ISMAR.

[4] Brox, T., Bruhn, A., Papenberg, N., and Weickert, J. (2004).

[21] Ochs, P., Chen, Y., Brox, T., and Pock, T. (2014). ip-

High accuracy optical flow estimation based on a theory for

iano: Inertial proximal algorithm for non-convex optimiza-

warping. In ECCV.

tion. SIAM JIS, 7(2).

[5] Brox, T., Bruhn, A., and Weickert, J. (2006). Variational

[22] Prusa, D. and Werner, T. (2015). Universality of the local

motion segmentation with level sets. In ECCV, volume 3951.

marginal polytope. PAMI, 37(4).

[6] Chambolle, A. and Pock, T. (2011). A first-order primal-

[23] Ranftl, R., Bredies, K., and Pock, T. (2014). Non-local

dual algorithm for convex problems with applications to

total generalized variation for optical flow estimation.

In

imaging. Journal of Mathematical Imaging and Vision, 40(1).

ECCV.

[7] Choi, J. and Rutenbar, R. A. (2012). Hardware implementa-

[24] Ranftl, R., Gehrig, S., Pock, T., and Bischof, H. (2012).

tion of MRF MAP inference on an FPGA platform. In Field

Pushing the limits of stereo using variational stereo estima-

Programmable Logic.

tion. In Intelligent Vehicles Symposium.

[8] Combettes, P. L. and Pesquet, J.-C. (2011). Proximal split-

[25] Roth, S., Lempitsky, V., and Rother, C. (2009). Discrete-

ting methods in signal processing. In Fixed-Point Algorithms

continuous optimization for optical flow estimation. In Statis-

for Inverse Problems in Science and Engineering.

tical and Geometrical Approaches to Visual Motion Analysis,

volume 5604.

[9] Drory, A., Haubold, C., Avidan, S., and Hamprecht, F.

(2014). Semi-global matching: A principled derivation in

[26] Shekhovtsov, A., Kovtun, I., and Hlaváč, V. (2008). Effi-

terms of message passing. In Pattern Recognition, volume

cient MRF deformation model for non-rigid image matching.

8753.

CVIU, 112.

[10] Facciolo, G., de Franchis, C., and Meinhardt, E. (2015).

[27] Shekhovtsov, A., Swoboda, P., and Savchynskyy, B.

MGM: A significantly more global matching for stereovision.

(2015). Maximum persistency via iterative relaxed inference

In BMVC.

with graphical models. In CVPR.

[11] Hirschmuller,

H. (2011).

Semi-global matching-

[28] Sinha, S. N., Scharstein, D., and Szeliski, R. (2014).

motivation, developments and applications.

Efficient high-resolution stereo matching using local plane

sweeps. In CVPR.

[12] Hosni, A., Rhemann, C., Bleyer, M., Rother, C., and

Gelautz, M. (2013). Fast cost-volume filtering for visual cor-

[29] Sontag, D. and Jaakkola, T. S. (2009). Tree block coordi-

respondence and beyond. PAMI, 35(2).

nate descent for MAP in graphical models. In AISTATS.

[13] Hurkat, S., Choi, J., Nurvitadhi, E., Martınez, J. F., and

[30] Taniai, T., Matsushita, Y., and Naemura, T. (2014). Graph

Rutenbar, R. A. (2012). Fast hierarchical implementation of

cut based continuous stereo matching using locally shared la-

sequential tree-reweighted belief propagation for probabilis-

bels. In CVPR.

tic inference. In Field Programmable Logic.

[31] Valkonen, T. (2014). A primal-dual hybrid gradient method

[14] Kappes, J. H., Andres, B., Hamprecht, F. A., Schnörr, C.,

for nonlinear operators with applications to MRI. Inverse

Nowozin, S., Batra, D., Kim, S., Kausler, B. X., Lellmann, J.,

Problems, 30(5).

Komodakis, N., and Rother, C. (2013). A comparative study

of modern inference techniques for discrete energy minimiza-

[32] Wainwright, M., Jaakkola, T., and Willsky, A. (2005).

tion problem. In CVPR.

MAP estimation via agreement on (hyper)trees: Message-

passing and linear-programming approaches. IEEE Trans-

[15] Kolmogorov, V. (2006). Convergent tree-reweighted mes-

actions on Information Theory, 51(11).

sage passing for energy minimization. PAMI, 28(10).

[33] Werlberger, M. (2012). Convex Approaches for High Per-

[16] Kolmogorov, V., Pock, T., and Rolinek, M. (2015). Total

formance Video Processing. PhD thesis, Institute for Com-

variation on a tree. CoRR, abs/1502.07770.

puter Graphics and Vision, Graz University of Technology,

Graz, Austria.

[34] Werlberger, M., Pock, T., and Bischof, H. (2010). Motion If i and j are two nodes in a chain f +λ then performing

estimation with non-local total variation regularization. In

the update of λi changes the min-marginal at j and vice-

CVPR.

versa. The updates must be implemented sequentially or

otherwise one gets a non-monotonous behavior and the

[35] Werner, T. (2007). A linear programming approach to max-

sum problem: A review. PAMI, 29(7).

method may fail to converge (see [15]).

TRW-S gains its efficiency in that after the update (31),

[36] Woodford, O., Torr, P., Reid, I., and Fitzgibbon, A. (2009).

the min-marginal at a neighboring node can be recom-

Global stereo reconstruction under second-order smoothness

puted by a single step of dynamic programming. Let the

priors. PAMI, 31(12).

neighboring node be j = i + 1. The expression for the

[37] Zabih, R. and Woodfill, J. (1994). Non-parametric local

right min-marginal at j remains correct and the expres-

transforms for computing visual correspondence. In ECCV,

sion for left min-marginal is updated using its recurrent

volume 801.

expression ϕij(xj) :=

min ϕ

Appendix A. Details of Dual MM

i−1,i(xi) + fi(xi) + fij (xi, xj ),

(32)

xi

In this section we specify details regarding computa-

also known as message passing. Then min-marginal at j

tion of minorants in Dual MM. The minorants are com-

becomes available through (29).

puted using message passing and we’ll also need the no-

It is possible to perform update (31) in parallel by scal-tion of min-marginals.

ing down the step size by the number of variables (or

the length of the chain). This is equivalent to decom-

A.1. Min-Marginals and Message Passing

posing a chain f into n copies with costs f /n so that

Definition A.1. For cost vector f its min-marginal at

they contribute one for each node i with a min-marginal

node i is the function mf : L →

mf (x

R given by

i)/n. Confer to the parallel tree block update algo-

rithm of Sontag and Jaakkola [29, Fig. 1]). However, the mf (xi) = min f (x).

(26)

gain from the palatalization does not pay off the decrease

xV\i

in the step size.

Function mf (xi) is a min projection of f (x) onto xi

A.2. Slacks

only. Given the choice of xi, it returns the cost of the best

labeling in f that passes through x

In the following we will also use the term slack.

i. For a chain problem

it can be computed using dynamic programming. Let us

Shortly, it is explained as follows. The dual problem (8)

assume that the nodes V are enumerated in the order of

can be written as a linear program, see e.g., [35]. Dual in-the chain and E = {(i, i + 1) | i = 1 . . . |V| − 1}. We then

equality constraints in that program can satisfied as equal-

need to compute: left min-marginals: ϕ

ities, in which case they are tight, or they can be satisfied

i−1,i(xi) :=

as strict inequalities in which case there is a slack. Equiv-

X

X

min

f

alent reparametrization of the problem (change of the dual

i0 (xi0 ) +

fi0j0(xi0, xj0);

(27)

x1,...i−1 i0<i

variables) can propagate a slack from one constraint (cor-

i0j0∈E | i0<i

responding to a label-node pair) to another one. If all

and right min-marginals: ϕi+1,i(xi) :=

constraints in a group becomes non-tight, their minimum

slack can be subtracted and increments the lower bound.

X

X

min

fi0(xi0) +

fi0j0(xi0, xj0). (28)

Since for a chain problem the LP relaxation is tight, the

xi+1,...|V | i0>i

i0j0∈E | i0≥i

maximum slack that can be concentrated in a label-node

equals the corresponding min-marginal.

These values for all ij ∈ E, xi, xj ∈ L can be computed

dynamically (recursively). After that, the min-marginal

A.3. Good Minoratns

mf (xi) expresses as

Definition A.2. A modular minorant λ of f is maximal

mf (x

if there is no other modular minorant λ0 ≥ λ such that

i) = fi(xi) + ϕi−1,i(xi) + ϕi+1,i(xi).

(29)

λ0(x) > λ(x) for some x.

TRW-S method [15] can be derived as selecting one

Lemma A.3. For a maximal minorant λ of f all min-

node i at a time and maximizing (8) with respect to λi

marginals of f − λ are identically zero.

only. For the two slave problems in (8) TRW-S needs

to compute min-marginals mf+λ(xi) and mg−λ(xi). A

Proof. Since λ is a minorant, min-marginals mi(xi) =

(non-unique) optimal choice for λi would be to ensure that

minx

[f (x) − λ(x)] are non-negative. Assume for con-

V\i

tradiction that ∃i, ∃xi such that mi(xi) > 0. Clearly,

mf+λ(xi) = mg−λ(xi) ∀xi ∈ L

(30)

λ0(x) := λ(x) + mi(xi) is also a minorant and λ0 >

λ.

by setting

Even using maximal minorants, the Algorithm 2 can

λi := λi + (mg−λ(xi) − mf+λ(xi))/2.

(31)

get stuck in fixed points which do not satisfy weak tree





agreement [15], e.g. suboptimal even in the class of mes-Algorithm 3: Maximal Uniform Minorant

sage passing algorithms. Consider the following example

Input: Chain subproblem f ;

of a minorant leading to a poor fixed point.

Output: Minorant λ;

Example A.4. Consider a model in Figure 8 with two la-

1 λ := 0;

bels and strong Ising interactions ensuring that the optimal

2 while true

labeling is uniform. If we select minorants that just takes

3

Compute min-marginals m of f − λ;

the unary term, without redistributing it along horizontal

4

if m = 0 then return λ;

or vertical chains, the lower bound will not increase. For

5

Let O := [ m = 0]], the support set of optimal

example, for the horizontal chain (v

solutions of m − λ;

1, v2), the minorant

(1, 0) (displayed values correspond to λ

6

Find max{ε | (∀x) εh1 − O, xi ≤ (f − λ)(x)};

v (1) − λv (2)).

This minorant is maximal, but it does not propagate the

7

Let λ := λ + ε(1 − O);

information available in v1 to v2 for the exchange with

the vertical chain (v2, v4).

The optimization problem in Line 6 can be solved us-

ing the minimum ratio cycle algorithm of Lawler [17]. We

+1

search for a path with a minimum ratio of the cost given

by (f − λ)(x) to the number of selected labels with non-

zero min-marginals given by h1 − O, xi. This algorithm is

rather efficient, however Algorithm 3 it is still too costly and not well-suited for a parallel implementation. We will

+0.5

not use this method in practice directly, rather it estab-

lishes a sound baseline that can be compared to.

The resulting minorant λ is maximal and uniform in

Figure 8: Example minorize-minimize stucks with a minorant

the following sense.

that does not redistribute slack.

Lemma A.6. Let m be the vector of min-marginals of f .

The uniform minorant λ found by Algorithm 3 satisfies

A.3.1

Uniform Minorants

λ ≥ m/n,

(34)

Dual algorithms, by dividing the slacks between subprob-

where n is the length of the longest chain in f .

lems ensure that there is always a non-zero fraction of it

(depending on the choice of weights in the scheme) prop-

Proof. This is ensured by Algorithm 3 as in each step the agated along each chain. We need a minorant, which will

increment ε results from dividing the min-marginal by

expose in every variable what is the preferable solution

h1 − O, xi which is at most the length of the chain.

for the subproblem. We can even try to treat all variables

In fact, when the chain is strongly correlated, the mi-

uniformly. The practical strategy proposed below is moti-

norant will approach m/n and we cannot do better than

vated by the following.

that. However, if the correlation is not as strong the mi-

Proposition A.5. Let f ∗ = min

norant becomes tighter, and in the limit of zero pairwise

x f (x) and let Ou be the

support set of all optimal solutions x∗

interactions there holds λ = m. In a sense the minorant

u in u ∈ V . Con-

sider the minorant λ given by λ

computes “decorrellated” min-marginals.

u(xu) = ε(1 − Ou) and

maximizing ε:

The next example illustrates uniform minorants and

steps of the algorithm.

max{ε | (∀x) εh1 − O, xi ≤ f(x)}.

(33)

Example A.7. Consider a chain model with the following

The above minorant assigns cost ε to all labels but

data unary cost entries (3 labels, 6 nodes):

those in the set of optimal solutions. If the optimal so-

0

0

1

0

0

8

lution x∗ is unique, it takes the form λ = ε(1 − x∗).

9

7

0

3

2

8

This minorant corresponds to the direction of the subgra-

7

3

6

9

1

0

dient method and ε determines the step size which ensures

The

regularization

is

a

Potts

model

with

cost

f

monotonicity. However it is not maximal. In f − λ there

uv (xu, xv )

= 1[[xu 6= xv] . Min-marginals of the

still remains a lot of slack that can be useful when ex-

problem and iteration of Algorithm 3 are ilustrated in

changing to the other problem. It is possible to consider

Figure 9. At the first iteration the constructed minorant is f − λ

0

0

0

0

0

1

again. If we have solved (33), it will necessarily

1

1

1

1

1

1

have a larger set of optimal solutions. We can search for

1

1

1

1

1

0

a maximal ε1 that can be subtracted from all non-optimal

And the final minorant is:

label-nodes in f − λ and so on. The algorithm is specified

as Algorithm 3.

0

0

0

0

0

7

Algorithm 4: Iterative Minorant

8

7

1

2

2

7

Input: Chain subproblem f ;

6

4

6

7

1

0

Output: Minorant λ;

The minorant follows min-marginals (first plot in Fig-

1 λ := 0;

ure 9), because the interaction strength is relatively weak 2 for s = 1 . . . max pass do

and min-marginals are nearly independent.

If we in-

3

for i = 1 . . . |V | do

crease interaction strength to 5, we find the following min-

4

Compute min-marginal mi of f − λ at i

marginals and minorant, respectively:

dynamically, equations (32) and (29);

0

0

0

0

0

3

5

λi += γsmi;

14

15

8

8

7

8

6

Reverse the chain;

12

13

15

10

1

0

0

0

0

0

0

3

5.5

5.5

3

3

3

3

efficiently alternates between the forward and the back-

ward passes. For the last pass coefficient γ

4.75

4.75

4.75

4.75

1

0

s is set to 1

to ensure that the output minorant is maximal. Figure 10

It is seen that in this case min-marginals are correlated and

illustrates that this idea can perform well in practice.

only a fraction can be drained in parallel. The uniform

approach automatically divides the cost equally between



strongly correlated labels.

TRW−S

8200

TRW−S primal

DMM−uniform

Primal

DMM−Iterative-s−3−frac−0.25

Primal

8000

0

0

0

0

0

7

DMM−Batch-Iter−3−frac−0.25

Primal

(a)

10

8

1

4

3

8

7800

8

5

7

10

1

0

7600

0

0

0

0

0

6

7400

(b)

9

6

0

1

2

7

7200



2

3

4

5

6

7

8

9

10

7

3

6

7

0

0

Figure 10:

0

0

0

0

0

5

Same setting as in Figure 3. The new plots show that

Iterative minorants are not as good as uniform but still perform

(c)

8

5

0

0

0

5

very well. Parameter max pass = 3 and γs = 0.25 were used.

6

2

5

6

0

0

The Batch Iterative method (Batch-Iter) runs forward-backward

iterations in a smaller range, which is more cache-efficient and

Figure 9: (a) Min-marginals (normalized by subtracting the

is also performing relatively well in this example.

value of the minimum) at vertices and arrows allowing to back-

track the optimal solution passing through a given vertex. (b),

(c) min-marginals of f − λ after one (resp. two) iterations of Al-

A.3.3

Hierarchical Minorants

gorithm 3 (ε1 = 1 and ε2 = 1). With each iteration the number of vertices having zero min-marginal strictly increases.

The idea of hierarchical minorants is as follows. Let f

be a one horizontal chain. We can break it into two sub-

chains of approximately the same size, sharing a variable

A basic performance test of Dual MM with uniform

xi in the middle. By introducing a Lagrange multiplier

minorants versus TRW-S is shown in Figure 3. It demon-

over this variable, we can decouple the two chains. The

strates that the Dual MM can be faster, when provided

value of the Lagrange multiplier can be chosen such that

good minorants. The only problem is that determining the

both subchains have exactly the same min-marginals in

uniform minorant involves repeatedly solving minimum

xi. This makes the split uniform in a certain sense. Pro-

ratio path problems, plus there is a numerical instability

ceeding so we increase the amount of parallelism and hi-

in determining the support set of optimal solutions O.

erarchically break the chain down to two-variable pieces,

for which the minorant is computed more or less straight-

A.3.2

Iterative Minorants

forwardly. This is the method used to obtain all visual

experiments in the paper. Its more detailed benchmarking

A simpler way to construct a maximal minorant would

is left for future work.

We detail now the simplest case

be to iteratively subtract from f a portion of its min-

when the chain has length two, i.e., the energy is given by

marginals and accumulate them in the minorant, until all

f

min-marginals of the reminder become zero. Algorithm 4

1(x1) + f12(x1, x2) + f2(x2). The procedure to compute

the minorant is as follows:

implements this idea.

The portion of min-marginals

• Compute mf

drained from the reminder f − λ to the minorant λ in each

1 (x1) and let λ1 := mf

1 (x1)/2. I.e., we

subtract a half of the min-marginal in the first node.

iteration is controlled by γs ∈ (0, 1]. Reversing the chain

• Recompute the new min-marginal at node 2: update

Algorithm 5: Handshake

[>>>>>>>>>>>>>>><<<<<<<<<<<<<<<]

[.......<<<<<<<][>>>>>>>.......]

Input: Energy terms fi, fj, fij, messages ϕi−1,i(xi)

[...<<<][>>>...][...<<<][>>>...]

and ϕj,j+1(xj) ;

[.<][>.][.<][>.][.<][>.][.<][>.]

Output: Messages for decorrellated chains: ϕji(xi)

[][][][][][][][][][][][][][][][]

and ϕij(xj) ;

/* Message from j to i

*/

Figure 11: Messages passed in the construction of the hier-

1 ϕji(xi) := Msgji(fj + ϕj,j+1);

archical minorant for a chain of length 32. From top to bot-

/* Total min-marginal at i

*/

tom: level of hierarchical processing. Symbols > and < denote

2 m

message passing in the respective direction. Brackets [] mark

i(xi) := ϕi−1,i(xi) + fi(xi) + ϕji(xi);

/

the limits of the decorrellated sub-chains at the current level.

* Share a half to the right

*/

Dots denote places where the previously computed messages in

3 ϕij (xj ) := Msgij (mi/2 − ϕji);

the needed direction remain valid and need not be recomputed.

/* Bounce back what cannot be shared */

Places where the two opposite messages meet correspond to the

4 ϕji(xi) := Msgji(−ϕij );

execution of the Handshake procedure. The lowest level con-5 Procedure Msgij (a)

sists of 16 decorrellated chains of length 2 each.

Input: Unary cost a ∈ K

R ;

Output: Message from i to j;



A.4. Iteration Complexity

6

return ϕ(xj) := minx

a(x

i ∈L

i) + fij (xi, xj );

The bottleneck in a fast implementation of dual algo-

rithms are the memory access operations. This is simply

the message ϕ12(x2) := Msg12(f1 − λ1); Reassem-

because there is a big cost data volume that needs to be

ble mf−λ

2

(x2) = ϕ12(x2) + f2(x2).

scanned in each iteration plus messages have to be red and

• Take this whole remaining min-marginal to the mi-

written in TRW-S as well as in out Algorithm 2 (dual vari-norant: let λ2 := mf−λ

2

(x2).

ables λ). We therefore will assess complexity in terms of

• Recompute the new min-marginal at node 1: update

memory access operations and ignore the slightly higher

the message ϕ21(x1) := Msg21(f2 −λ2); It still may

arithmetic complexity of our minorants.

be non-zero. For example, if the pairwise term of f

For TRW-S the accesses per pixel are:

is zero we recover the remaining half of the initial

• read all incoming messages (4 access);

min-marginal at node 1. Let λ1 += mf−λ

1

(x1).

• read data term (1 access);

Importantly, the computation has been expressed in terms

• write out messages in the pass direction (2 accesses).

of message passing, and therefore can be implemented

The cache can potentially amortize writing messages and

as efficiently. The procedure fro the two-node case is

reading them back in the next scan line, in which case

straightforwardly generalized to longer chains. Let ij be

the complexity could be counted as 5 accesses per pixel.

an edge in the middle of the chain. We compute left min-

However, currently only CPU cache is big enough for this,

marginal at i, right min-marginal at j and then apply the

while multiprocessors in GPU have relatively small cache

Handshake procedure over the edge ij, defined in Algo-

divided between many parallel threads.

rithm 5. The procedure divides the slack between nodes i For the iterative minorant we have 3 forward-backward

and j similarly to how it is described above for the pair.

passes reading the data cost, the reverse message and writ-

The result of this redistribution is encoded directly in the

ing the forward message (3*2*3 accesses), the last itera-

messages. The two subchains 1, . . . i and j, . . . |V| are

tion writes λ and not the message. Some saving is pos-

“decorrellated” by the Handshake and will not talk to

sible with a small cache set at a cost of more computa-

each other further during the construction of the minorant.

tions. Computing the hierarchical minorant as described

The left min-marginal for subchain j, . . . |V| at node j + 1

in Figure 11 for a chain of length 2048, assuming that

is computed using update (32) and so on until the mid-

chunks of size 8 already fit in the fast memory (registers

dle of the subchain where a new Handshake is invoked.

+ shared memory) has the following complexity. Read-

The minorant is computed at the lowest level of hierarchy

ing data costs and writing messages until length 8 totals

when the length of the subchain becomes two. The struc-

to 2 + log2(2048/8)/2 = 6 accesses. Reading messages

ture of the processing is illustrated in Figure 11. It is seen is only required at Handshake points and needs to be

that each level after the top one requires to send messages

counted only until reaching length 8. Writing λ adds one

only for a half of nodes in total. Moreover, there is only

more access. These estimates are summarized in Table 2.

a logarithmic number of level. It turns out that this pro-

cedure is not much more computationally costly than just

TRW-S Iterative Naive BCD Hierarchical

computing min-marginals.

For example, to restore left

7(5)

18(8)

5(4)

7

min-marginal for the subchain j, . . . |V |, in node i + 1 we

We conjecture that while iterative minorants may trans-

Table 2: Memory accesses per pixel in TRW-S and Dual MM

fer only a geometric fraction of min-marginals in some

with variants of minorants. Naive BCD here means just comput-

ing min-marginals.

cases, the hierarchical minorant is only by a constant fac-

tor inferior to the uniform one.





21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

Touching without vision: terrain perception in sensory deprived environments Vojtěch Šalansk´y∗, Vladim´ır Kubelka∗†, Karel Zimmermann∗, Michal Reinstein∗, Tomáš Svoboda∗†

Abstract. In this paper we demonstrate a combined

hardware and software solution that enhances sensor

suite and perception capabilities of a mobile robot

intended for real Urban Search & Rescue missions.

A common fail-case, when exploring unknown envi-

ronment of a disaster site, is the outage or deteriora-

tion of exteroceptive sensory measurements that the

Figure 1. From left: UGV robot approaches smoke area;

robot heavily relies on—especially for localization

Example of visual information that the operator sees

and navigation purposes. Deprivation of visual and

inside a cloud of smoke: a crop out from the omni-

laser modalities caused by dense smoke motivated

directional camera (middle) and output of the laser range-

us to develop a novel solution comprised of force

finder (rainbow-colored point cloud in the right half of the

sensor arrays embedded into tracks of our platform.

image). Laser beams are randomly reflected by smoke

particles. The resulting 3D point cloud is just noise close

Furthermore, we also exploit a robotic arm for ac-

to the robot.

tive perception in cases when the prediction based on

force sensors is too uncertain. Beside the integration

of hardware, we also propose a framework exploiting

project1, which develops novel software and tech-

Gaussian processes followed by Gibb’s sampling to

nology for human-robot teams in disaster response

process raw sensor measurements and provide prob-

efforts [1], we have to deal with such problems.

abilistic interpretation of the underlying terrain pro-

One of the crucial fail-cases is the presence of

file. In the final, the profile is perceived by propri-

dense smoke that blocks camera view and spoils laser

oceptive means only and successfully substitutes for

measurements, creating false obstacles in front of the

the lack of exteroceptive measurements in the close

robot (Fig. 1). Without exteroceptive measurements,

vicinity of the robot, when traversing unknown and

classical approaches to robot SLAM cannot be used.

unseen obstacles. We evaluated our solution on real

Localization can only be in the dead-reckoning sense

world terrains.

and the operator of the robot has to rely solely on

the maps created up to the point of the sensor out-

age. In an industrial environment consisting of many

1. Introduction

hazardous areas, driving blind can lead to damage or

loss of the robot.

Advances in robotic technology allow mobile

Therefore, we propose a combined hardware and

robots to be deployed in gradually more and more

software solution to predict the profile of terrain un-

challenging environments. However, real-world con-

derneath and in front of the tracked robot. The al-

ditions often complicate or even prohibit adoption of

gorithm exploits a prototype of a force sensor array

classical approaches to localization, mapping, nav-

installed inside a track of the robot, a robotic arm

igation, or teleoperation.

When rescuers operate

attached to the robot, proprioceptive measurements

a UGV during joint experiments in the TRADR

from joints and an inertial measurement unit (IMU),

and information learned from a dataset of traversed

∗Authors are with the Faculty of Electrical Engineer-

terrains. The prototype of the force sensor (Fig. 2, 3)

ing, Czech Technical University in Prague, {salanvoj,

kubelvla, reinstein.michal, zimmerk,

is suitable for tracked robots and is installed between

svobodat}@fel.cvut.cz

rubber track and its support, allowing it to serve as

†Authors are with the Czech Institute of Informatics,

Robotics and Cybernetics, Czech Technical University in Prague

1http://www.tradr-project.eu

a tactile sensor. The arm is used to measure height of

tion; they mimic facial whiskers of animals and us-

terrain outside the reach of the force sensor as contact

ing them as a tactile sensor is a promising way to

between the arm end-effector and the terrain. The

explore areas, which are prohibitive to standard ex-

height of terrain that cannot be measured directly is

teroceptive sensors. Work of [5] presents a way to

estimated by sampling from a joint probability dis-

use array of actively actuated whiskers to discrimi-

tribution of terrain heights, conditioned by propri-

nate various surface textures. In [6], similar sensor

oceptive measurements (geometric configuration of

is used for a SLAM task. Two sensing modalities—

the robot, torques in joints and attitude of the robot)

the whisker sensor array and the wheel odometry are

and learned from a training dataset consisting of real-

used to build a 2D occupancy map. Robot localiza-

world examples of traversed terrains.

tion is then performed using particle filter with par-

The estimates of terrain profile are used as a par-

ticles representing one second long ”whisk periods”.

tial substitute for missing laser range-finder data that

During these periods, the sensor actively builds lo-

would reveal obstacles or serve as an input for adap-

cal model of the obstacle it touches. Unfortunately,

tive traversability algorithm.

design of our platform does not allow using such

Our contribution is twofold: we designed a new

whiskers due to rotating laser range-finder.

force sensor suitable for tracked robots as well as an

Relation between shape of terrain that we are in-

algorithm that uses proprioceptive and tactile mea-

terested in and configuration of the flippers is investi-

surements to estimate terrain shape in conditions that

gated in [7]. The authors exploit the knowledge about

prohibit usage of cameras and laser range-finders.

robot configuration and torques in joints to define

We extended this solution with robotic arm to deal

a set of rules for climbing and descending obstacles

with special cases when the predictions have too high

not observed by exteroceptive sensors. We investi-

uncertainty.

gated this problem in [8] by introducing the adaptive

The rest of the paper is structured as follows: Sec-

traversability algorithm based on machine learning.

tion II concludes briefly the related work, Section III

We collected features from both proprioceptive and

describes the hardware solution and Section IV the

exteroceptive sensors to learn a policy that ensures

actual software. In Section V we present both qual-

safe traversal over obstacles by adjusting robot mor-

itative and quantitative experimental evaluation and

phology. An idea of adding pressure sensors mimick-

we conclude our achievements in Section VI.

ing properties of human skin to feet of bipedal robots

is presented in [9, 10]. These sensors can be used

2. Related work

for measuring force distribution between the robotic

The problem of terrain characterization primarily

foot and ground, or for terrain type classification. In

using proprioceptive sensors, but also by sonar/infra-

tracked robots, caterpillar tracks can be further used

red range-finders and by a microphone is discussed in

to explore terrain, authors of [11] propose a novel

[2]. The authors exploit neural networks trained for

distributed sensor that detects deflection of the track

each sensor and demonstrate that they are able to rec-

in contact points with terrain. Their sensor is espe-

ognize different categories: gravel, grass, sand, pave-

cially suitable for chained tracks with rubber shoes.

ment and dirt surface. More recent results come from

The prototype we present is more suitable for thin

legged robotics, in [3], Pitman-Yor process mixture

rubber tracks.

of Gaussians is used to learn terrain types both in

On contrary to the approaches exploiting only

supervised and unsupervised manner based on force

simple contact sensors, we extend our sensory suite

and torque features sensed in legs. In our work, we

with a robotic arm for further active perception for

focus more on the actual terrain profile prediction,

cases if necessary. Related to the active perception,

necessary for successful traversal.

relevant ideas and techniques come from the field of

Lack of sufficient visual information related to

haptics. The work of [12] proposes to create mod-

danger of collision with obstacles is addressed in

els of objects in order to be able to grasp them. The

[4]: decision whether it is safe to navigate through

idea is to complement visual measurements by tac-

vegetation is based on wide-band radar measure-

tile ones by strategically touching the object in ar-

ments since it is impossible to detect solid obstacle

eas with high shape uncertainty. For this purpose

behind vegetation from laser range-finder or visual

they use Gaussian processes (GP, [13]) to express the

data. Artificial whiskers offer an alternative solu-

shape of the object. We take a similar approach: we





choose parts of terrain to be explored by the robotic

arm based on uncertainty of the estimate resulting

from the sampling process (Sec. 4.3). Probabilistic

approach to express uncertainty in touched points is

also described in [14], where only tactile sensors of

a robotic hand are used to reconstruct the shape of

an unknown object. Active tactile terrain exploration

Figure 2. Prototype of the flipper force sensor: array of six

can also lead to terrain type classification, as works

sensing elements (FSR 402) is covered by a stripe of steel,

of [15, 16] demonstrate.

forming a thin sensor that fits between the rubber track

and the plastic track support. The stripe of steel protects

3. Sensors

the sensors from the moving rubber track and distributes

measured force amongst them.

3.1. Sensors of the TRADR UGV

The TRADR UGV platform is equipped with both

proprioceptive and exteroceptive sensors.

Inertial

measurement unit Xsens MTi-G (IMU) provides ba-

sic attitude measurements; all joints have angle en-

coders installed to reveal current configuration of the

robot like flipper angles, and velocity of the caterpil-

lar tracks. Electric currents to all motors are mea-

sured and translated into torque values. Visual in-

Force sensing elements

Analog-to-digital converter

formation about the environment is acquired by an

+5V

omni-directional Point Grey Ladybug 3 camera ac-

FSR 402

companied by a rotating SICK LMS-151 laser range

I2C

ADC Pi Plus

Raspberry Pi 2B

finder that provides depth information.

The laser

R1 10k

range-finder is used to collect data that are processed

to serve as ground truth for the terrain reconstruction

purposes.

...

3.2. Prototype of force sensor

Figure 3. The sensor mounted to the plastic track support

To obtain well-defined contact points with the

(top). The sensing elements are passive sensors that ex-

ground, we decided to take advantage of the flippers

hibit decrease in resistance with applied force. For each

sensing element, we use a reference resistor to form a volt-

that can reach in front of the robot and are designed to

age divider; we obtain voltage inversely proportional to

operate on dirty surfaces or sharp edges. The original

the resistance of the FSR 402 elements (bottom).

mechatronics of the robot allows to measure torque

in flipper servos and thus detect physical contact be-

tween flippers and the environment. To be able to

ing force; the force sensitivity range is 0.1 − 10 N.

locate the contact point on the flipper exactly, we de-

To measure the resistance, we connect them in series

signed a thin force sensor between the rubber track

with a fixed reference resistor forming a voltage di-

and its plastic support (see Fig. 2, 3). Since it is a first vider. We apply 5 V to this divider and measure volt-prototype, we use it only in one flipper and consider

age on the reference resistor. We use an analog-to-

only symmetrical obstacles or steps. The sensor con-

digital converter expansion board for the Raspberry

struction is a sandwich of two thin stripes of steel

Pi computer to read the six voltages. We calibrate

with FSR 402 sensing elements between them which

the voltage values for initial bias caused by the sand-

allows the rubber track to slide over it while mea-

wich construction.

suring forces applied onto the track. There are six

Figure 4 shows three examples of the sensor read-

force sensing elements; the protecting sheet of steel

ings. The first case consists of a flipper touching flat

distributes the force among them, the sensor is thus

floor. Although one would expect to see more or less

sensitive along its whole length.

equal distribution of the contact force along the flip-

The FSR 402 sensing elements are passive sen-

per track, the torque generated by the flipper actually

sors that exhibit decrease in resistance with increas-

lifts the robot slightly and thus, most of the force con-





5

allows the robot to measure the height of terrain in

10N)

a chosen point by gradually lowering the arm until

≈ 4

upsurge of actuator currents indicates contact with

3

ground (there are currently no touch sensors) [17].

Accuracy of the measurement is 3 cm (standard de-

2

viation). However, the process of unfolding the arm,

1

planning and execution of the desired motion and fi-

nally folding back to home position can easily take

Sensor element output (5 units 0

1

2

3

4

5

6

45 s. Therefore, it is practical to use the arm for this

Sensor element number

purpose only in situations when the gain from the

5

10N)

additional information overweights the cost of time

≈ 4

spent to get it. In Section 4.4, we describe criterion

for decision to use the arm.

3

2

4. Terrain shape reconstruction

1

Sensor element output (5 units 0

When robot is teleoperated operator’s awareness is

1

2

3

4

5

6

Sensor element number

based on camera images and the 3D laser map. In the

5

presence of smoke, both of these modalities are use-

10N)

less, see output of the operator console in the pres-

≈ 4

ence of smoke shown in Figure 1. We propose active

3

tactile exploration mode (ATEM), in which flippers

and robotic arm autonomously explores the terrain

2

shape in close vicinity of the robot. Estimated ter-

1

rain shape and expected reconstruction accuracy are

Sensor element output (5 units

eventually displayed to the operator.

0

1

2

3

4

5

6

Sensor element number

If ATEM is requested by the operator, robot first

Figure 4. Examples of the force sensor readings. The plots

adjusts flippers to press against the terrain and cap-

on the left side show raw readings of each sensing ele-

ture proprioceptive measurements. Then the initial

ment, only corrected for bias. The photos on the right side

probabilistic reconstruction of the underlying terrain

document the moments of the readings acquisition. See

shape is estimated from the captured data. If the re-

section 3 for discussion over the three example cases.

construction is ambiguous, the robotic arm explores

the terrain height in the most inaccurate place. Even-

centrates at its tip (element n. 6). Compare this case

tually, the probabilistic reconstruction is repeated.

with the third one (bottom), where the pose of the

As a result, reconstructed terrain shape with esti-

robot prohibits the lifting effect, and we therefore see

mated variances is provided. The ATEM procedure

the expected result. The second case (middle) shows

is summarized in Algorithm 1. The rest of this sec-

an example of a touch in one isolated point.

tion provides detailed description of particular steps.

3.3. Robotic arm

4.1. Flipper exploration mode

The UGV is equipped with a Kinova Jaco robotic

arm1, see Fig. 1 left. It is a 6-DOF manipulator (with As soon as the ATEM is requested, the robot halts

one extra DOF in each finger) capable of lifting 1.5

driving and adjusts angles of front flippers towards

kg. For our approach, it is used for tactile exploration

ground until they reach an obstacle or the ground.

of surroundings up to cca. 50 cm around the robot.

They keep pressing against it by defined torque while

For the terrain sensing, robotic arm holds a tool with

vector of proprioceptive measurements s is captured.

a wooden stick—this setup protects its fingers from

We measure: i) pitch of the robot (estimated from

being broken when pushing against ground. It also

IMU sensor), ii) angles of flippers, iii) currents in

1http://www.kinovarobotics.com/service-

flipper engines, and iv) 6-dimensional output of the

robotics/products/robot-arms

force sensor.

Variables: h - vector of terrain bin heights,

distributions p(hI |hJ\I , s) of all missing heights hI .

v - vector of height variances,

Missing heights hI are reconstructed as the mean of

s - vector of proprioceptive measurements.

generated samples, variances v

while ATEM is requested do

I are estimated as the

stop robot;

variance of samples.

// Invoke flipper exploration mode

In the beginning, the missing heights hI are ran-

// Section 4.1

domly initialized. The k-th sample hk is obtained

while torque in front flippers < threshold do

I

by iterating over all unknown bins i ∈ I and gener-

push flippers down;

end

ating their heights hk from conditional probabilities

i

s = capture proprioceptive measurements();

p(hi|hJ\i, s). The conditional probability is mod-

// Perform kinematic reconstruction

eled by Gaussian process [19, 13, 20] with a squared

// Section 4.2

exponential kernel.

[h, v] = kinematic reconstruction(s);

To train the conditional probabilities, we collected

// Perform probabilistic reconstr.

real-world trajectories with i) sensor measurements

// Section 4.3

[h, v] = probabilistic reconstruction(h, v, s);

su and ii) corresponding terrain shapes hu estimated

// Invoke arm exploration

from the 3D laser map for u = 1 . . . U . The i-th

// Section 4.4

conditional probability p(hi|hJ\i, s) is modeled by

if any(v > threshold) then

one Gaussian process learned from the training set

[h, v] = arm exploration(h, v);

{[(h1 , s1)>, h1], . . . , [(hU , sU )>, hU ]}.

[h, v] = probabilistic reconstruction(h, v, s);

J \i

i

J \i

i

end

Modeling the bin height probabilities as normal

move forward;

distributions is a requirement laid by the Gaussian

end

process. However, it allows samples of the bin height

that collide with the body of robot, which is of course

Algorithm 1: Active tactile exploration mode for

physically impossible.

We propose to use Gaus-

terrain shape reconstruction.

sian distribution truncated by known kinematic con-

straints, in which are samples constrained by the

4.2. Kinematic reconstruction

maximal height that does not collide with the body

of the robot. We discuss impact of this modification

The terrain shape is modeled by Digital Eleva-

in the Section 5.

tion Map (DEM), which consists of eleven 0.1 m-

wide bins. If there is only one isolated contact point

4.4. Active arm exploration

sensed by the force sensor and the force surpasses

We use the robotic arm to measure the height of

experimentally identified threshold (see Fig. 4, sec-

the terrain in bins the flippers cannot reach. The

ond case), the height hi of the terrain in the corre-

measurement taken by the robotic arm is reasonably

sponding bin i is estimated by a geometric construc-

accurate and precise but in its current state it takes

tion from known robot kinematics, using the attitude

about 45s to complete [17]. If the probabilistic recon-

of the robot, configuration of joints and the position

struction contains bins with variance v higher than

of the contact point on the flipper. Variance vi for

a user-defined threshold, the robotic arm is used to

the corresponding force sensor is set to an experi-

measure the height in the most uncertain bin, i.e. the

mentally estimated value. The remaining hi and vi

bin j = arg max

values are set to non-numbers.

i vi. The height sensed in the given

bin is then fixed and the probabilistic reconstruction

4.3. Probabilistic reconstruction

process is repeated.

In the probabilistic reconstruction procedure, the

5. Experimental evaluation

vector of heights h and the vector of variances v are

estimated by the Gibbs sampling [18]. Let us de-

In qualitative experiments, we focus on typical

note the set of all bins J and the set of all bins in

cases of terrain profile shapes and discuss perfor-

which the reconstruction is needed by I (i.e. those

mance of different settings of our algorithm. In quan-

which height was not estimated in the kinematic re-

titative experiments, we present performance statis-

construction procedure or measured by the robotic

tics over the whole testing dataset.

arm). We use the Gibbs sampling to obtain height

The training dataset consists of 28 runs contain-

samples hk, k = 1 . . . K from the joint probability

ing driving on flat terrain, approaching obstacles of

I





5

4

3

2

1

Sensor element output 0

1 2 3 4 5 6

Sensor element number

Figure 5. From left: photo of the robot on a concrete ground; measured forces; terrain reconstruction, the gray polygon indicates position of the robot and its flippers, thin red line is the ground truth—flat ground in this instance.

two different heights, traversing them and descend-

fourth approach adds direct terrain measurement: we

ing from them back to flat ground. Shape of obsta-

simulate use of the robotic arm for measurements the

cles selected for the dataset reflects the industrial ac-

terrain height in bins with high uncertainty [17]. The

cident scenario of the TRADR project - the environ-

simulation means revealing the value of the bin cap-

ment mostly consists of right-angle-shaped concrete

tured in the ground truth, variance of the bin is then

and steel objects. From the recorded runs, we have

equal to the variance of the arm measurements. In

extracted approximately 1400 individual terrain pro-

the experiments shown in this paper we set the stan-

file measurements for training. The whole training

dard deviation threshold of Gibbs samples that leads

dataset was recorded indoors on flat hard surfaces.

to arm exploration to 0.06 m. The fourth approach is

The testing dataset was recorded outdoors and com-

called as PAFAc (pitch + angle of flippers + flipper

bines uneven grass, stone and rough concrete sur-

force sensor + robotic arm; constrained).

faces. It contains more complex obstacles with vari-

ous heights (different from those seen in the training

5.1. Qualitative Evaluation

dataset). The testing dataset consists of more than

In the figures 5, 6 and 7, we present typical ter-300 terrain profiles with the corresponding sensory

rain profiles and robot actions: flat ground, two steps

data. Ground truth necessary for training and test-

with different height, climbing up a step and stepping

ing was created manually by sampling scans from the

down of a step. We compare performance of two al-

laser range-finder recorded during the experiments.

gorithms: i) PAc uses the kinematic constraints when

We compare four different algorithms for terrain

sampling but does not use the force sensors (light

profile prediction. The baseline approach [8] uses

blue line in the plots) ii) PAFc algorithm which uses

only the IMU sensor and angles of flippers, we call

the force sensors (green line and bars). The last two

it PA (pitch + angle of flippers) for short. The sec-

bars marked yellow in order to emphasize the predic-

ond setup uses the same data and adds the probability

tions are learnt from training dataset and we do not

of terrain height being adapted in the way described

have enough information to correct the predictions

in Section 4.3. If the sampled height collides with

from the sensing by flippers.

the robot, the sample is set to the maximal possible

We use mean of the (Gibbs) samples as the pre-

height that is not in collision. The approach is called

dicted value (connected by lines) and 0.1 and 0.9

PAc (pitch + angle of flippers; constrained). The

quantiles for displaying dispersion of samples (error-

third approach adds the flipper force sensor; mea-

bars). The point (0, 0) coincides with the location of

sured data are used in two ways. If the force mea-

the IMU sensor inside the robot body. The depicted

sured by a sensor element exceeds a threshold (ex-

sketch of the robot: the pitch is estimated by IMU,

perimentally set on 2 units), then the height of the

flipper angle is directly measured. When the robot

bin is computed from kinematics of the robot (pitch

lies on a flat ground, Fig. 5, contact point is sensed

and flipper angles and position of the sensor element)

by the sixth element. The force measurement reduces

and the bin is fixed and excluded from the Gibbs sam-

uncertainty mainly in positions 0.3 − 0.7 m.

pling step. It should be noted however, that the mea-

Climbing up a step cases are depicted in Fig. 6.

sured forces are used even if they are not bigger than

The higher 0.28 m step obstacle is on top. The fifth

the threshold – they are part of the proprioceptive

sensor element measures the force that is bigger than

data s. The approach is called as PAFc (pitch + angle

threshold and the height in the bin 0.4 is fixed and

of flippers + flipper force sensor; constrained). The

not sampled. Note that algorithm PAc which does





5

4

3

2

1

Sensor element output 0

1 2 3 4 5 6

Sensor element number

5

4

3

2

1

Sensor element output 0

1 2 3 4 5 6

Sensor element number

Figure 6. Top: 0.28 m step, bottom: 0.2 cm step. Note the reduced uncertainty for the PAFc – green line and errorbars.

The top photo of the robot is flipped in order to preserve left-to-right orientation which should ease the visual comparison.

5

4

3

2

1

Sensor element output 0

1 2 3 4 5 6

Sensor element number

5

4

3

2

1

Sensor element output 0

1 2 3 4 5 6

Sensor element number

Figure 7. Top: climbing up a step; Bottom: stepping down of a step. When stepping down, the robot “hangs” on the rear flippers, not the main flippers.

600



0.15



PAFc

0.14

PA

PAc

0.13

500

PAc

0.12

PAFc

0.11

PAFAc

400

0.1

0.09

0.08

300

0.07

Frequency

0.06

200

0.05

0.04

Reconstruction error [m] 0.03

100

0.02

0.01

0

0

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

−0.3−0.2−0.1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7

error [m]

x [m]

Figure 8. Quantitative evaluation of reconstruction quality

Figure 9. Quantitative evaluation of terrain profile recon-

in the places/bins that are under the flipper.

struction – for all the DEM bins. Median, 1st quartile and

3rd quartile of errors are shown

not use force sensor cannot predict the exact edge

To evaluate our proposed solution experimentally,

location. This fact is indicated by big dispersion of

we designed and compared four algorithms—four

samples in bins 0.3 and 0.4. The second situation

possible approaches for proprioceptive terrain shape

shown in Fig. 6 is the lower step. The height of the

reconstruction: simple kinematics based approach,

lower step 0.2 m was correctly measured by the sixth

constrained kinematics, constrained kinematics with

element of the force sensor.

force sensors, and constrained kinematics with both

Climbing up and stepping down cases are dis-

force sensors and robotic arm—intended for special

played in Fig. 7. Variances in the bins that are un-

cases, where terrain prediction reaches very high un-

derneath the robot are high because we do not have

certainty. From the presented qualitative and quan-

enough information to estimate the correct heights.

titative experimental evaluation we can clearly see

Still, the means are correct due to models learnt from

that enhancing the sensor suite with force sensor

the training data.

array proves to be superior.

The proposed algo-

rithm, which combines Gaussian processes followed

5.2. Quantitative Evaluation

by Gibb’s sampling, was successfully implemented

As our metric of performance is the absolute error

on-board the robot to process the raw force measure-

of estimated bin heights which is non-negative, we

ments and perform the actual terrain shape prediction

prefer to describe its statistical properties by quan-

in a probabilistic manner. We certainly do not claim

tile characteristics rather than by the means and stan-

this is the only and best way to perform such terrain

dard deviations. The statistics are computed from the

prediction, but, it definitely serves as sufficiently ro-

whole testing dataset - i.e. from more than 300 out-

bust and accurate proof of concept for intended de-

door terrain profiles.

ployment. As part of this concept, the integration

First, we measure the direct effect of the force

of robotic arm for active perception in cases when

measurement on the accuracy of the height estimates.

the prediction based on force sensors is too uncertain

The graph on Fig 8 shows the height error frequency

proved to be important. For future work, we aim to

of the DEM-bins that are underneath the front flipper.

embed additional force sensor arrays on all the four

Note that the attribute “underneath the front flipper”

robot flippers and extend the terrain prediction algo-

is not fixed, it depends on the flipper angle. The force

rithm accordingly.

sensor indeed improves the accuracy over the using

ACKNOWLEDGMENT

the flipper angle only.

The second experiment studies the statistics for

The authors were supported by the Czech Science

all the DEM-bins individually, see Fig 9. Adding

Foundation GA14-13876S, the EU-FP7-ICT-609763

the kinematic contraint c naturally improves the

TRADR project and the Czech Technical University

estimates of the bins underneath the robot body

grant SGS15/081/OHK3/1T/13.

(−0.3 . . . 0.2). Using the force sensors (PAFc) im-

proves height estimates of the DEM-bins underneath

the front flipper (0.3 . . . 0.5). The bins in front of the

flippers, i.e. (0.6 and 0.7) are directly measurable

only by the arm exploration. It is thus obvious that

including the measurement by arm (PAFAc) has the

dominant effect.

6. Conclusions

In this paper the aim was to demonstrate a com-

bined hardware and software solution that enhances

sensor suite and perception capabilities of our mo-

bile robot intended for real Urban Search & Rescue

missions. We focused our efforts on enabling pro-

prioceptive terrain shape prediction for cases when

vision and laser measurements are unavailable or de-

teriorated (such as in presence of a dense smoke).

References

[10] J. Shill, E. Collins, E. Coyle, and J. Clark,

“Terrain identification on a one-legged hopping

[1] I. Kruijff-Korbayová, F. Colas, M. Gianni,

robot using high-resolution pressure images,”

F. Pirri, J. de Greeff, K. V. Hindriks, M. A.

in Robotics and Automation (ICRA), 2014. 2

Neerincx, P. Ögren, T. Svoboda, and R. Worst,

“TRADR project:

Long-term human-robot

[11] D. Inoue, M. Konyo, K. Ohno, and S. Tadokoro,

teaming for robot assisted disaster response,”

“Contact points detection for tracked mobile

KI, vol. 29, no. 2, pp. 193–201, 2015. 1

robots using inclination of track chains,” in In-

ternational Conference on Advanced Intelligent

[2] L. Ojeda,

J. Borenstein,

G. Witus,

and

Mechatronics, 2008, pp. 194–199. 2

R. Karlsen, “Terrain characterization and clas-

sification with a mobile robot,” Journal of Field

[12] M. Bjorkman, Y. Bekiroglu, V. Hogman, and

Robotics, vol. 23, no. 2, pp. 103–122, 2006. 2

D. Kragic, “Enhancing visual perception of

shape through tactile glances,” in Intelligent

[3] P. Dallaire, K. Walas, P. Giguere, and B. Chaib-

Robots and Systems (IROS), 2013. 2

draa, “Learning terrain types with the pitman-

yor process mixtures of Gaussians for a legged

[13] C. K. Williams and C. E. Rasmussen, “Gaus-

robot,” in Intelligent Robots and Systems

sian processes for regression,” in Advances

(IROS), 2015. 2

in Neural Information Processing Systems 8,

D. Touretzky, M. Mozer, and M. Hasselmo,

[4] J. Ahtiainen, T. Peynot, J. Saarinen, and

Eds.

The MIT Press, 1996, pp. 514–520. 3, 5

S. Scheding, “Augmenting traversability maps

with ultra-wideband radar to enhance obstacle

[14] M. Meier, M. Schöpfer, R. Haschke, and H. Rit-

detection in vegetated environments,” in Intelli-

ter, “A probabilistic approach to tactile shape

gent Robots and Systems (IROS), 2013. 2

reconstruction,” Robotics, IEEE Transactions

on, vol. 27, no. 3, pp. 630–635, 2011. 3

[5] J. Sullivan,

B. Mitchinson,

M. Pearson,

M. Evans, N. Lepora, C. Fox, C. Melhuish,

[15] J. Romano and K. Kuchenbecker, “Methods for

and T. Prescott, “Tactile discrimination using

robotic tool-mediated haptic surface recogni-

active whisker sensors,” IEEE Sensors Journal,

tion,” in Haptics Symposium (HAPTICS), 2014

vol. 12, no. 2, pp. 350–362, 2012. 2

IEEE, 2014, pp. 49–56. 3

[6] M. Pearson, C. Fox, J. Sullivan, T. Prescott,

[16] D. Xu, G. Loeb, and J. Fishel, “Tactile identi-

T. Pipe, and B. Mitchinson, “Simultaneous

fication of objects using Bayesian exploration,”

localisation and mapping on a multi-degree

in Robotics and Automation (ICRA), 2013. 3

of freedom biomimetic whiskered robot,” in

[17] V. Šalansk´y, “Contact terrain exploration for

Robotics and Automation (ICRA), 2013. 2

mobile robot,” Master’s thesis, Czech Techni-

[7] K. Ohno, S. Morimura, S. Tadokoro, E. Koy-

cal University in Prague, 2015, in Czech. 4, 5,

anagi, and T. Yoshida, “Semi-autonomous con-

6

trol system of rescue crawler robot having flip-

[18] S. Geman and D. Geman, “Stochastic relax-

pers for getting over unknown-steps,” in Intelli-

ation, gibbs distributions, and the bayesian

gent Robots and Systems (IROS), 2007. 2

restoration of images,” Pattern Analysis and

[8] K. Zimmermann,

P. Zuzanek,

M. Rein-

Machine Intelligence, IEEE Transactions on,

stein, T. Petricek, and V. Hlavac, “Adaptive

no. 6, pp. 721–741, 1984. 5

traversability of partially occluded obstacles,”

[19] A. O’Hagan and J. Kingman, “Curve fitting

in Robotics and Automation (ICRA), 2015. 2, 6

and optimal design for prediction,” Journal of

[9] H. Lee, “Development of the robotic touch foot

the Royal Statistical Society. Series B (Method-

sensor for 2d walking robot, for studying rough

ological), pp. 1–42, 1978. 5

terrain locomotion,” Master’s thesis, University

[20] C. E. Rasmussen and H. Nickisch. Gpml matlab

of Kansas, June 2012, mechanical Engineering.

code. 5

2

21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

Hessian Interest Points on GPU

Jaroslav Sloup, Michal Perd’och, Štěpán Obdržálek, Jiř´ı Matas

Center for Machine Perception

Czech Technical University Prague

sloup|perdom1|xobdrzal|matas @fel.cvut.cz

Abstract.

environment (e.g. illumination) change. Establishing

This paper is about interest point detection and

correspondences between interest points represent-

GPU programming. We take a popular GPGPU im-

ing an object in multiple images is a building step

plementation of SIFT – the de-facto standard in fast

for a multitude of computer vision tasks, including

interest point detectors – SiftGPU and implement

stereo or multi-view reconstruction, object recogni-

modifications that according to recent research re-

tion, and image search and retrieval.

sult in better performance in terms of repeatability

These are the desirable qualities for which interest

of the detected points. The interest points found at

point detectors are evaluated:

local extrema of the Difference of Gaussians (DoG)

• Transformation Covariance.

The detected

function in the original SIFT are replaced by the lo-

points should correspondingly ‘follow’ the ob-

cal extrema of determinant of Hessian matrix of the

ject as it is depicted from different viewpoints.

intensity function.

This paper concerns similarity-covariant detec-

Experimentally we show that the GPU implemen-

tors which follow 2D image locations (objects

tation of Hessian-based detector (i) surpasses in

at different positions in the image), scales (ob-

repeatability the original DoG-based implementa-

jects at different distances) and 2D orientations

tion, (ii) gives result very close to those of a refer-

(in-plane rotation of the objects). Affine detec-

ence CPU implementation, and (iii) is significantly

tors, which additionally follow out-of-plane 3D

faster than the CPU implementation. We show what

rotations, are not considered here.

speedup is achieved for different image sizes and

provide analysis of computational cost of individual

• Repeatability of detected interest points. The

steps of the algorithm.

percentage of the points detected at corre-

The source code is publicly available.

sponding image locations when the viewpoint

changes.

• Accuracy

1. Introduction

with which the interest points are lo-

cated and their scales and orientations are esti-

A viewpoint-independent representation of ob-

mated.

jects in images is one of the fundamental problems

• Coverage of various visually different classes

in computer vision. A popular approach is to extract

of objects.

a set of local measurements, known as descriptors,

at a sparse set of image locations. These locations

• Robustness under image degradation – noise,

are called interest points and their purpose is (as op-

motion blur, compression, out of focus images,

posed to dense image sampling) to reduce the spatial

etc.

domain of further computation, hence reducing the

•

cost to obtain, and memory requirements to store, the

Detection Speed, the computational cost of the

image representation.

interest points detection.

It follows that for an interest point extraction pro-

One of the most popular interest point detection

cess to be practical it needs to repeatedly identify the

algorithms is still the Scale-Invariant Feature Trans-

same points on object surface when the viewpoint or

form (SIFT) proposed by David Lowe [8] in 2004.

It consistently ranks high on benchmarks in qual-

the feature type (saddle, dark or white blob) is now

ity of detected points, but is computationally expen-

part of the GPU code output. This is useful in follow-

sive, therefore unsuitable e.g. for real-time video pro-

up matching – features of different types should not

cessing. Many speedier approximations and alter-

be considered for a correspondence.

natives were proposed, e.g. SURF [2], FAST [11]

Some of the functionality that was available in

and ORB [12], or CenSurE [1] and SUSurE [3],

the original CPU SIFT implementation and omitted

which can detect interest points significantly faster

in the GPU version was reintroduced. We add the

than SIFT. But often at the expense of repeatability

optional capability to compute orientations and de-

and accuracy.

scriptors only in h0, πi range instead of h0, 2πi by

The only widely used detector that in most tests

disregarding sign of the gradients involved, which is

scores higher in repeatability than SIFT is the so-

beneficial when matching images taken under signif-

called Hessian detector. In SIFT, points are identi-

icantly different illumination (day and night). The

fied at local minima or maxima of the Difference of

restriction that at each image location only at most

Gaussians function, thence in presence of blob-like

two interest point orientations are detected was lifted.

local image structures. In the Hessian detector, the

And the maximal number of iterations used for sub-

points are located where the determinant of the Hes-

pixel localization of a detected point is now config-

sian matrix (a matrix of second-order partial deriva-

urable, the original Sift-GPU code allowed only a

tives) attains local extrema. Which occur either for

single iteration.

blob-like (local maxima) or for saddle-like structures

In the rest of the paper we quickly describe the

(local minima). Experiments show that the extrema

SIFT detector and explain the relations and differ-

of the determinant of Hessian are more repeatable

ences between the Laplacian operator, the Difference

and accurate than the extrema of the Difference of

of Gaussians and the determinant of the Hessian ma-

Gaussians, and, thanks to the additional detection of

trix (Section 2). In Section 3 we sketch the GPU

saddle points, the object coverage is generally also

implementation and analyze the computational cost

improved. The detection speed of Hessian is similar

of individual components. Experiments in Section 4

to that of the SIFT.

show that the Hessian indeed achieves better perfor-

mance than original SIFT and that the GPU and CPU

Taking advantage of the recent widespread avail-

implementations of Hessian give very similar results.

ability of programmable graphic cards, execution

time of many computer vision algorithms benefits

2. Laplacian of Gaussian, Difference of Gaus-

if reimplemented for GPUs. Interest point detectors

sians and Determinant of Hessian Matrix

are no exception, a GPGPU (general-purpose GPU)

SIFT implementation is available from [15, 14, 4].

Let us consider a grayscale image to be a dis-

The SIFTs are detected in real-time for moderately

cretized form of an underlying real-valued continu-

sized videos or images on a consumer-grade GPU,

ous function f (x, y) :

2

R → R. Its Gaussian scale-

therefore there is now a large group of applications

space representation L(x, y, t) :

3

R

→ R is then

for which it is no longer necessary to sacrifice detec-

defined as

tion quality for execution speed.

L(x, y; t) = g(x, y, t) ∗ f (x, y)

We build upon the available GPU SIFT implemen-

tation [15] and extend it with several contributions.

where

The Difference of Gaussians is replaced with the de-

1

g(x, y, t) =

e− x2+y2

2t

terminant of the Hessian matrix as the function of

2πt

which extrema indicate presence of interest points.

is a rotationally symmetric 2D Gaussian kernel

This improves repeatability, and coverage, of the de-

parametrized by variance t = σ2, and where ∗ de-

tected points, as is experimentally demonstrated be-

notes convolution. Partial Gaussian derivatives of

low. Selection of best K points (when ordered by

the image at a given scale t are then written as

magnitude of the determinant) is implemented in an

early stage of the algorithm. If only a specific num-

Lxαyβ (·, ·, t) = ∂xαyβ L(·, ·, t)

ber of points is requested, it is faster to decide which

=

(∂xαyβ g(·, ·, t)) ∗ f(·, ·).

these are early, on the GPU, before orientations are

determined and descriptors computed. Additionally,

The Hessian matrix for a given t is a square matrix

of second-order partial derivatives

The determinant of the Hessian operator has bet-

ter scale selection properties under affine image

 ∂2(f ∗ g)

∂2(f ∗ g) 

transformations than the Laplacian operator or its





∂x2

∂x ∂y

L

H = 



xx

Lxy

Difference-of-Gaussians approximation [7]. It was



=

.

∂2(f ∗ g)

∂2(f ∗ g) 





Lxy

Lyy

also shown to perform significantly better for image-

∂x ∂y

∂y2

based matching using local SIFT-like or SURF-like

image descriptors, leading to higher efficiency and

Let λ1 and λ2 denote the eigenvalues of the Hes-

precision scores [7]. In an approximation computed

sian matrix. Laplacian (or the Laplace operator, the

from Haar wavelets it is the basis for the interest

sum of second partial derivatives) of the Gaussian is

point detector in SURF [2].

then

∇2L = Lxx + Lyy = λ1 + λ2.

3. GPU Implementation and Computation

The Laplacian of Gaussian, appropriately normal-

Time Analysis

ized for different scales [7], is a basis for one of

the first and also most common detector of blob-

The GPU interest point implementation proceeds

like interest points. Local scale-space extrema are

in steps shown in Figure 1. First, the input image is

detected that are maxima/minima of ∇2L simulta-

loaded and transferred to a GPU texture. The scale

neously with respect to both space (x, y) and scale

pyramid data structures, which make up the major-

t [5]. In discrete domain, interest points are detected

ity of the GPU memory required, are allocated once

if the value of ∇2L at this point is greater/smaller

at the beginning, and reallocated only in case a big-

than all values in its 26-neighbourhood. Locations

ger image is eventually processed later. The alloca-

of such points are covariant with translations, rota-

tion typically takes several hundreds of milliseconds.

tions and rescaling in the image domain. If a scale-

Initial image upscaling by a factor of two, which

space maximum is found at a point (x

is sometimes used in feature detection, is not per-

0, y0; t0) then

after a rescaling of the image by a scale factor s

formed. The scale space pyramid is then filled – a

there will be a corresponding scale-space maximum

process that involves smoothing with Gaussian ker-

at (sx

nels with multiple std. deviations. Keypoints are de-

0, sy0; s2t0) [6].

The

Laplacian

of

the

Gaussian

operator

tected as scale-space extrema of the determinant of

∇2L(x, y, t) can be approximated [7] with a

the Hessian matrix and their locations are collected

difference between two Gaussian-smoothed images

to a linear list. Optionally, the points are ordered by

at different scales t and t + ∆t

the response (the absolute value of the determinant)

and only the top K points are kept for further pro-

t

∇2L(x, y; t) ≈

(L(x, y; t + ∆t) − L(x, y; t)) .

cessing. Keypoint orientations are then determined,

∆t

with approximately 20% of the points ending with

two or more orientations assigned. The points, now

This approach is referred to as the Difference of

with the orientations, are again collected to a list and

Gaussians (DoG). In fashion similar to the Laplacian

SIFT descriptors are computed.

detector, interest points are detected as extrema in the

3D scale-space. The Difference of Gaussian is used

Figure 2 shows the execution speed measured on

in the SIFT algorithm [8].

three GPU cards. The photo shown on left, which

Another differential interest point detector is de-

represents a typical picture used in large-scale im-

rived from the determinant of the Hessian matrix H

age retrieval tasks, was resized to eight different res-

olutions. Three CUDA-enabled graphics cards were

det HL(x, y; t) = (L

tested: NVidia GeForce GT 730M (384 CUDA cores

xxLyy − L2

xy ) = λ1λ2.

in 2 streaming multiprocessors, 1024MB DDR3

At image locations where the determinant is positive

memory, 64-bit bus) is a representative of a common

the image contains a blob-like structure. The Hes-

mobile/laptop GPU. NVidia GTX 750Ti (640 CUDA

sian matrix will there either be positive or negative

cores in 5 SMs, 2048MB GDDR5 memory, 128-bit

definite, indicating presence of either bright or dark

bus) represents a gaming desktop card, and NVidia

blobs. If the determinant of the Hessian matrix is

GTX Titan Black (2880 CUDA cores in 15xSMs,

negative, the matrix is indefinite, which indicates a

6144MB GDDR5 memory, 384-bit bus) is a server

saddle-like interest point [5].

card. Additionally, execution times of the reference



(Re)Allocate

Pyramid

Keypoint

Linear list of

Load image

pyramid

construc on

detec on

detected

points

Mul ‐

GPU code

Top K

Keypoint

orienta on

selec on

orienta ons

Descriptors

linear list

CPU code

Figure 1. Block diagram of the computation pipeline. CPU code shown in yellow, GPU code in blue.

Total me (excluding image load and pyramid alloca on)

350

300

250

200

150

Time [ms]

100

50

0

2592x1944 1920x1440 1600x1200 1280x960

1024x768

800x600

640x480

320x240

GT 730M

231,13

142,53

108,01

78,91

60,67

46,57

36,06

20,59

GTX 750Ti

64,08

43,25

35,67

27,57

24,02

19,05

17,15

11,09

GTX Titan

38,5

29,55

24,81

20,75

18,53

16,67

14,95

10,78

i7 4770

306,67

174,79

125,12

86,61

61,3

40,94

29,67

9,91

Figure 2. Detection time for a test image (left) at eight different resolutions (right). Three GPUs were measured, together with a reference CPU implementation.

CPU implementation running on a current desktop

4. Experiments

CPU (i7 4470) are reported. While the mobile GPU

The performance of the proposed GPGPU im-

is only slightly faster than the CPU, the other two

plementation of the determinant-of-Hessian detec-

GPU cards are roughly five and eight times faster.

tor was compared with other publicly available de-

tectors, based on the Difference of Gaussians, mul-

Figure 3 shows a break down of load distribution

tiscale Laplacian and the determinant of the Hes-

over individual stages of the keypoint detection pro-

sian matrix.

In particular, we have evaluated

cess (refer to Fig. 1). The analysis is shown for the

Lowe’s[8] original version of SIFT and its VLFeat

desktop (left) and the mobile (right) GPUs. While

re-implementation, CPU implementation of the Hes-

the desktop card is about five times faster, the pro-

sian and the Laplacian, and the original GPU code

portional distribution of the load is very similar.

of SiftGPU. SURF detector [2], which is based on

a fast approximation of the Hessian matrix, is also

Comparing the execution speed of the original

included. Two sets of experiments are presented:

Sift-GPU implementation (using the Difference of

first one evaluating transformation invariance of the

Gaussians) with our Hessian-based detector, see

detectors in terms of repeatability and the number

Fig. 4, we observe that the quality improvement

of correspondences, second one evaluating perfor-

demonstrated below in Experiments comes at no ad-

mance in a retrieval system.

ditional computational cost.

4.1. Parameter Setting

Finally in Figure 5 we show the timing when re-

One of the advantages of the determinant of Hes-

questing only the best K keypoints. As expected,

sian based detector is in responding to an additional

the stages preceding the top K selection are not af-

type of local features – saddle points [6]. In our ini-

fected.

The stages following, orientation estima-

tial experiments on a large set of images, we ob-

tion and computation of the descriptor, take longer

served that the number of saddle points in natural

for more keypoints, although the increase is sub-

images is about the same as the number of bright

linear until the GPU processing power is saturated

and dark blobs together. Therefore the Hessian gives

at around 8000 descriptors computed in parallel.

roughly twice as many points as the Laplacian/DoG

Time [ms]

Timing for GTX 750Ti

Time [ms]

Timing for GT 730M

70

250

Descriptors

Descriptors

60

Mul ‐orienta on linear list

Mul ‐orienta on linear list

200

Keypoint orienta on

Keypoint orienta on

50

Linear list of detected points

Linear list of detected points

150

40

Keypoint detec on

Keypoint detec on

30

Pyramid construc on

Pyramid construc on

100

20

50

10

0

2592x1944 1920x1440 1600x1200 1280x960

1024x768

800x600

640x480

320x240

2592x1944 1920x1440 1600x1200

1280x960

1024x768

800x600

640x480

320x240

21,86

17,21

14,82

12,49

11,88

9,94

9,05

6,26

62,71

43,23

34,02

27,24

23,46

18,39

15,21

9,35

1,54

1,21

1,25

1,1

1,12

0,78

0,95

0,55

3,06

2,81

2,69

2,52

2,25

1,98

1,89

1,48

3,56

3,3

3,03

2,81

2,58

2,34

2,13

1,51

9,2

7,92

7,17

6,57

5,95

5,36

4,8

3,24

4,99

2,93

2,16

1,5

1,06

0,77

0,61

0,31

23,96

13,31

9,59

6,41

4,15

2,81

1,96

0,77

15

8,57

6,2

4,21

2,95

2,03

1,55

0,81

62,03

34,59

24,45

15,99

10,66

7

4,79

1,89

17,13

10,03

8,21

5,46

4,43

3,19

2,86

1,65

70,17

40,67

30,09

20,18

14,2

11,03

7,41

3,86

Figure 3. Execution time of individual stages of the computation pipeline (refer to Fig. 1), evaluated at several image resolutions, with a default threshold on the detector response. The desktop GPU is about five times faster than the mobile GPU, but the relative load distribution between individual stages is virtually identical. Also the relation of the execution speed and image resolution is similar.

Time [ms]

Timing for GTX 750Ti

Time [ms]

Timing for GTX 750Ti

90

250

Descriptors

Descriptors

80

Mul ‐orienta on linear list

Mul ‐orienta on linear list

200

70

Keypoint orienta on

Keypoint orienta on

60

Linear list of detected points

Linear list of detected points

150

50

Keypoint detec on

Keypoint detec on

40

Pyramid construc on

Pyramid construc on

100

30

20

50

10

0

0

2592x1944 1920x1440 1600x1200 1280x960

1024x768

800x600

640x480

320x240

topK = 1

10

100

1000

2000

5000

10000

25000

all

0,48

1,84

6,18

12,15

15,75

24,27

36,89

71,36

144,27

22,4

18,61

15,85

13,6

12,09

11,16

9,24

7,13

0,11

0,29

0,8

1,23

1,5

1,69

2,17

2,82

5,98

1,53

1,4

1,18

1,03

0,98

1,06

0,74

0,78

0,22

0,54

1,5

2,72

3,17

3,75

4,26

5,79

8,91

3,57

3,43

3,14

2,84

2,65

2,48

2,17

1,69

1,63

1,81

2,35

2,65

2,85

2,78

3,12

3,08

0

13,37

9,63

8,13

6,41

5,16

4,99

3,69

2,98

5,62

5,68

6,15

6,48

6,76

6,72

7,01

6,63

5,07

14,67

8,37

5,97

4,02

2,77

1,95

1,4

0,75

22,33

22,29

22,32

22,34

22,35

22,34

22,32

22,3

22,46

20,76

13,12

9,85

7,04

5,32

4,71

3,34

2,29

17,11

17,05

16,79

16,94

17,02

16,63

17,18

16,87

17,01

Figure 4. Execution time of the original Sift-GPU code,

Figure 5. Execution times when a limited number of

evaluated at several image resolutions. Compare to the

K best points is requested. Computed on the full size

timing of our Hessian-based detector on the same hard-

2592x1944 image without a threshold on detector re-

ware (Fig. 3 left). The improved qualitative performance, sponse. As expected, the processing time of the steps pre-demonstrated in Section 4, comes with a negligible com-

ceding the top K selection are not affected, while the later

putational cost.

steps, most importantly the computation of the descriptor,

scale with the number of points requested.

detectors, if detector configurations and thresholds

used in the detector repeatability experiment.

are kept the same. To take an advantage of these

additional points while keeping the representations

4.2. Datasets and Evaluation Protocols

comparable in size for the experiments, the detected

points in each image were ordered by the absolute re-

A standard benchmark protocol and dataset for

sponse value of the detector and the best 1000, 2000

evaluation of covariant interest point detectors was

and 4000 points were selected for evaluation. Fi-

proposed by Mikolajczyk et al. [9]. It consists of

nally, to diminish the slight differences in detection

eight sets, each of six images, with an increasing

of dominant orientation, the orientations were fixed

effect of image distortions: camera viewpoint, im-

to vertical in the retrieval experiment, and were not

age scale, isotropic blur, underexposure and image

compression. We have selected one scene with each

valuable correspondences between the images. With

distortion. Ground truth transformations are known,

the exception of image blur, the fast but approximate

relating reference images of each set to all other im-

SURF performs slightly worse than the other meth-

ages in that set. The transformations are used to com-

ods.

pute repeatability scores by considering the overlap

error of all pairs of detected points:

4.4. Evaluation in Image Retrieval

RE ∩ R

The repeatability of a detector predicts its pairwise

1

H> E

(R

12

2H−1

12

E , R

) = 1 −

,

1

E2

R

matching potential. To assess the discrimination abil-

E ∪ R

1

H> E

12

2H−1

12

ity of a coupling of a detector (DoG, Laplacian, or

where R

Hessian) with a descriptor (SIFT), a large-scale im-

E represents the elliptic region defined by

x>R

age retrieval experiment was performed. The Oxford

Ex = 1 and H12 is the known homography be-

tween the reference 1 and the test image 2. To com-

building dataset with about 5000 images was used.

pensate for different sizes of regions from different

Each detector was again run in three configurations,

detectors, a scale factor is applied such that a region

requesting at most 1000, 2000, resp. 4000 best inter-

R

est points. The SIFT descriptor was computed from

E

is transformed to a normalized size (equivalent

1

to a radius of 30 pixels). Before evaluating the over-

a local neighborhood around each point and stored.

lap error, region R

As there are no significant orientation changes in the

E

is scaled using the same factor.

2

The image retrieval performance was tested us-

dataset, orientation of the interest points was fixed as

ing the Oxford buildings dataset and protocol defined

vertical in this experiment. The measurement region

by Philbin et al. [10]. In short, five queries are de-

sizes – radius of the SIFT index w.r.t. the detected

fined for each of eleven landmarks in Oxford, and a

scale of a interest point – were kept on their default

ground truth shortlist of positive examples is given.

values: 6.0 for DoG detectors (Lowe, VLFeat), 5.2

For each query an average precision (AP) is com-

for Laplacian and CPU and GPU Hessian. The rea-

puted as the area below the precision-recall curve.

soning behind this is that the DoG detectors return

Finally, a mean AP (mAP) is reported for the whole

slightly smaller (5-10%) intrinsic scale, determined

set of 55 query images.

by the smaller of the two subtracted Gaussians.

A standard Bag of Words (BoW) approach with

4.3. Evaluation of detectors Repeatability

and without Spatial Verification (SV) was used [13,

Repeatability is of one of the important properties

10]. SIFT descriptors were quantized into three dif-

of interest point detectors. It is a measure that ap-

ferent vocabularies for each detector, with: 500k vi-

proximates the probability of the point redetection

sual words for 1000 points/image, and 1M visual

given the distortion between images. The detectors

words for 2000 resp. 4000 interest points per im-

should be configured to provide comparable numbers

age. The TF-IDF scoring in an efficient inverted in-

of points to make the assessment fair. The repeata-

dex was used to get the BoW ranking. The spatial

bility score is complemented with the absolute num-

verification estimated a similarity transformation be-

ber of corresponding points detected – the predicted

tween the query and each of the top 1000 ranked

upper bound of the matching problem. Figures 6, 7

images. Finally, images were re-ranked based on

and 8 show the measured scores when the number of

number of correspondences. The ranking for each

detected points was limited to 1000, 2000, and 4000

query was evaluated using Oxford buildings protocol

respectively.

and an mean Average precision computed as defined

We observe that all the three DoG-based detectors

in [10].

(original Lowe’s, from VLFeat and SiftGPU) per-

The results are summarized in Table 1. The Hes-

form virtually the same, as do the two Hessian-based

sian detectors consistently outperformed both the

detectors (CPU and GPU implementations).

This

Laplacian and the Difference of Gaussians, regard-

strongly indicates that the measured performance is

less the size of the representation. Particularly for the

indeed inherent of the methods and not of a partic-

highest number of interest points per image (4000),

ular implementation. We also see that the Hessian

where both DoG implementations were struggling to

performs in most cases better than the Laplacian and

deliver this many points, their performance dropped.

its DoG approximation. The additionally detected

Thus we can conclude that the complementary saddle

saddle points complement the blobs well and provide

points detected by the Hessian detector consistently

Viewpoint - Graffiti, K = 1000

Zoom+Rotation - Bark, K = 1000

Blur - Bikes, K = 1000

Illumination - Cars, K = 1000

100

100

100

100

GPU Hessian

GPU Hessian

GPU Hessian

GPU Hessian

90

CPU Hessian

90

CPU Hessian

90

CPU Hessian

90

CPU Hessian

GPU SIFT

GPU SIFT

GPU SIFT

GPU SIFT

80

CPU SIFT Lowe

80

CPU SIFT Lowe

80

CPU SIFT Lowe

80

CPU SIFT Lowe

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU OpenSURF

CPU OpenSURF

CPU OpenSURF

CPU OpenSURF

70

70

70

70

CPU Laplacian

CPU Laplacian

CPU Laplacian

CPU Laplacian

60

60

60

60

50

50

50

50

40

40

40

40

Repeatability %

Repeatability %

Repeatability %

Repeatability %

30

30

30

30

20

20

20

20

10

10

10

10

0

0

0

0

20

25

30

35

40

45

50

55

60

1

1.5

2

2.5

3

3.5

4

2

2.5

3

3.5

4

4.5

5

5.5

6

2

2.5

3

3.5

4

4.5

5

5.5

6

Viewpoint angle

Scale change

Increasing blur

Decreasing light

Viewpoint - Graffiti, K = 1000

Zoom+Rotation - Bark, K = 1000

Blur - Bikes, K = 1000

Illumination - Cars, K = 1000

600

400

700

700

GPU Hessian

GPU Hessian

GPU Hessian

SIFT Lowe

CPU Hessian

650

350

CPU Hessian

CPU Hessian

SIFT VLFeat

500

GPU SIFT

GPU SIFT

GPU SIFT

Laplacian

CPU SIFT Lowe

CPU SIFT Lowe

600

CPU SIFT Lowe

650

CPU Hessian

CPU SIFT VLFeat

300

CPU SIFT VLFeat

CPU SIFT VLFeat

GPU Hessian

CPU OpenSURF

CPU OpenSURF

CPU OpenSURF

550

GPU DoG

400

CPU Laplacian

CPU Laplacian

CPU Laplacian

250

500

600

300

200

450

400

550

150

200

N of correspondences

N of correspondences

N of correspondences 350

N of correspondences

100

300

500

100

50

250

0

0

200

450

20

25

30

35

40

45

50

55

60

1

1.5

2

2.5

3

3.5

4

2

2.5

3

3.5

4

4.5

5

5.5

6

2

2.5

3

3.5

4

4.5

5

5.5

6

Viewpoint angle

Scale change

Increasing blur

Decreasing light

Figure 6. Repeatability score and number of correspondences on image sequences with (from left to right): a significant view angle change, scale change, image blur and exposure change. Number of features per image was limited to the best 1000 according to absolute response value.

Viewpoint - Graffiti, K = 2000

Zoom+Rotation - Bark, K = 2000

Blur - Bikes, K = 2000

Illumination - Cars, K = 2000

100

100

100

100

GPU Hessian

GPU Hessian

GPU Hessian

GPU Hessian

90

CPU Hessian

90

CPU Hessian

90

CPU Hessian

90

CPU Hessian

GPU SIFT

GPU SIFT

GPU SIFT

GPU SIFT

80

CPU SIFT Lowe

80

CPU SIFT Lowe

80

CPU SIFT Lowe

80

CPU SIFT Lowe

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU OpenSURF

CPU OpenSURF

CPU OpenSURF

CPU OpenSURF

70

70

70

70

CPU Laplacian

CPU Laplacian

CPU Laplacian

CPU Laplacian

60

60

60

60

50

50

50

50

40

40

40

40

Repeatability %

Repeatability %

Repeatability %

Repeatability %

30

30

30

30

20

20

20

20

10

10

10

10

0

0

0

0

20

25

30

35

40

45

50

55

60

1

1.5

2

2.5

3

3.5

4

2

2.5

3

3.5

4

4.5

5

5.5

6

2

2.5

3

3.5

4

4.5

5

5.5

6

Viewpoint angle

Scale change

Increasing blur

Decreasing light

Viewpoint - Graffiti, K = 2000

Zoom+Rotation - Bark, K = 2000

Blur - Bikes, K = 2000

Illumination - Cars, K = 2000

800

1200

1600

1600

GPU Hessian

GPU Hessian

GPU Hessian

GPU Hessian

CPU Hessian

CPU Hessian

700

CPU Hessian

CPU Hessian

GPU SIFT

1400

1400

1000

GPU SIFT

GPU SIFT

GPU SIFT

CPU SIFT Lowe

CPU SIFT Lowe

CPU SIFT Lowe

CPU SIFT Lowe

CPU SIFT VLFeat

600

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU OpenSURF

CPU OpenSURF

1200

CPU OpenSURF

1200

CPU OpenSURF

CPU Laplacian

800

CPU Laplacian

CPU Laplacian

CPU Laplacian

500

1000

1000

600

400

800

800

300

400

N of correspondences

N of correspondences

N of correspondences

N of correspondences

600

600

200

200

100

400

400

0

0

200

200

20

25

30

35

40

45

50

55

60

1

1.5

2

2.5

3

3.5

4

2

2.5

3

3.5

4

4.5

5

5.5

6

2

2.5

3

3.5

4

4.5

5

5.5

6

Viewpoint angle

Scale change

Increasing blur

Decreasing light

Figure 7. Repeatability score and number of correspondences on image sequences with (from left to right): a significant view angle change, scale change, image blur and exposure change. Number of features per image was limited to the best 2000 according to absolute response value.

Viewpoint - Graffiti, K = 4000

Zoom+Rotation - Bark, K = 4000

Blur - Bikes, K = 4000

Illumination - Cars, K = 4000

100

100

100

100

GPU Hessian

GPU Hessian

GPU Hessian

GPU Hessian

90

CPU Hessian

90

CPU Hessian

90

CPU Hessian

90

CPU Hessian

GPU SIFT

GPU SIFT

GPU SIFT

GPU SIFT

80

CPU SIFT Lowe

80

CPU SIFT Lowe

80

CPU SIFT Lowe

80

CPU SIFT Lowe

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU OpenSURF

CPU OpenSURF

CPU OpenSURF

CPU OpenSURF

70

70

70

70

CPU Laplacian

CPU Laplacian

CPU Laplacian

CPU Laplacian

60

60

60

60

50

50

50

50

40

40

40

40

Repeatability %

Repeatability %

Repeatability %

Repeatability %

30

30

30

30

20

20

20

20

10

10

10

10

0

0

0

0

20

25

30

35

40

45

50

55

60

1

1.5

2

2.5

3

3.5

4

2

2.5

3

3.5

4

4.5

5

5.5

6

2

2.5

3

3.5

4

4.5

5

5.5

6

Viewpoint angle

Scale change

Increasing blur

Decreasing light

Viewpoint - Graffiti, K = 4000

Zoom+Rotation - Bark, K = 4000

Blur - Bikes, K = 4000

Illumination - Cars, K = 4000

2000

1800

3000

3000

GPU Hessian

GPU Hessian

GPU Hessian

GPU Hessian

1800

CPU Hessian

1600

CPU Hessian

CPU Hessian

CPU Hessian

GPU SIFT

GPU SIFT

2500

GPU SIFT

2500

GPU SIFT

1600

CPU SIFT Lowe

CPU SIFT Lowe

CPU SIFT Lowe

CPU SIFT Lowe

1400

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU SIFT VLFeat

CPU OpenSURF

CPU OpenSURF

CPU OpenSURF

CPU OpenSURF

1400

CPU Laplacian

1200

CPU Laplacian

2000

CPU Laplacian

2000

CPU Laplacian

1200

1000

1000

1500

1500

800

800

600

1000

1000

N of correspondences

600

N of correspondences

N of correspondences

N of correspondences

400

400

500

500

200

200

0

0

0

0

20

25

30

35

40

45

50

55

60

1

1.5

2

2.5

3

3.5

4

2

2.5

3

3.5

4

4.5

5

5.5

6

2

2.5

3

3.5

4

4.5

5

5.5

6

Viewpoint angle

Scale change

Increasing blur

Decreasing light

Figure 8. Repeatability score and number of correspondences on image sequences with (from left to right): a significant view angle change, scale change, image blur and exposure change. Number of features per image was limited to the best 4000 according to absolute response value.

Method

Max.feat.

Lowe DoG

VLFeat DoG

CPU Laplacian

CPU Hessian

GPU Hessian

1000

0.551

0.512

0.572

0.584

0.579

BoW

2000

0.517

0.547

0.568

0.625

0.629

4000

0.558

0.585

0.617

0.643

0.615

1000

0.590

0.554

0.601

0.627

0.621

BoW+SV

2000

0.584

0.594

0.617

0.675

0.678

4000

0.639

0.650

0.692

0.716

0.699

Table 1. Image retrieval experiment. The Bag of Words (BoW) method with and without Spatial Verification (SV) was evaluated with different interest point implementations. Features were limited to best 1000, 2000 resp. 4000 points per image based on detector’s response. The values in the table are the measured mean average precisions, defined in [10].

improve the retrieval performance.

[5] T. Lindeberg. Scale-Space Theory in Computer Vi-

sion. Kluwer, 1994. 3

5. Conclusion

[6] T. Lindeberg. Feature detection with automatic scale

selection. IJCV, 30(2):79–116, 1998. 3, 4

We have implemented an interest point detector

[7] T. Lindeberg.

Image matching using generalized

based on the determinant of the Hessian matrix. Such

scale-space interest points. Journal of Mathemati-

a detector was previously shown, and the observation

cal Imaging and Vision, 52(1):3–36, 2015. 3

was confirmed in our experiments, to be superior in

[8] D. G. Lowe. Distinctive image features from scale-

the quality of detected points to commonly used de-

invariant keypoints. International Journal on Com-

tectors based on the Difference of Gaussians. Start-

puter Vision, 20(2):91–110, 2004. 1, 3, 4

ing with a publicly available GPU implementation of

[9] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zis-

SIFT detector, we have implemented several modi-

serman, J. Matas, F. Schaffalitzky, T. Kadir, and

fications and experimentally verified that the perfor-

L. V. Gool. A comparison of affine region detectors.

IJCV

mance indeed improved. The implementation, which

, 65(1-2):43–72, 2005. 5

is in CUDA for compatible NVidia graphics cards,

[10] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zis-

serman. Object retrieval with large vocabularies and

was published and made available.

fast spatial matching. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recog-





Acknowledgements


nition, 2007. 6, 8

The authors were supported by Toyota Motor Eu-

[11] E. Rosten and T. Drummond.

Machine learning

rope.

for high-speed corner detection. In A. Leonardis,

H. Bischof, and A. Pinz, editors, Computer Vision

ECCV 2006, volume 3951 of Lecture Notes in Com-

References

puter Science, pages 430–443. Springer Berlin Hei-

[1] M. Agrawal, K. Konolige, and M. R. Blas. Censure:

delberg, 2006. 2

Center surround extremas for realtime feature detec-

[12] E. Rublee, V. Rabaud, K. Konolige, and G. Brad-

tion and matching. In D. A. Forsyth, P. H. S. Torr,

ski. Orb: An efficient alternative to sift or surf. In

and A. Zisserman, editors, ECCV (4), volume 5305

Computer Vision (ICCV), 2011 IEEE International

of Lecture Notes in Computer Science, pages 102–

Conference on, pages 2564–2571, Nov 2011. 2

115. Springer, 2008. 2

[13] J. Sivic and A. Zisserman. Video google: A text

[2] H. Bay, A. Ess, T. Tuytelaars, and L. van Gool.

retrieval approach to object matching in videos. vol-

Speeded-up robust features (surf). Computer Vision

ume 2, pages 1470–1477, 2003. 6

and Image Understanding (CVIU), 110(3):346–359,

[14] M. Soltan Mohammadi and M. Rezaeian. Siftcu:

June 2008. 2, 3, 4

An accelerated cuda based implementation of sift.

[3] M. Ebrahimi and W. W. Mayol-Cuevas. SUSurE:

In Third Symposium on Computer Science and Soft-

Speeded Up Surround Extrema feature detector and

ware Engineering, Sharif University, Tehran, vol-

descriptor for realtime applications.

pages 9–14,

ume 3, 2013. 2

Aug. 2009. 2

[15] C. Wu. SiftGPU: A GPU implementation of scale

[4] H. Fassold and J. Rosner. A real-time gpu imple-

invariant feature transform (SIFT). http://cs.

mentation of the sift algorithm for large-scale video

unc.edu/˜ccwu/siftgpu, 2007. 2

analysis tasks. In IS&T/SPIE Electronic Imaging,

pages 940007–940007. International Society for Op-

tics and Photonics, 2015. 2





21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

BaCoN: Building a Classifier from only N Samples

Georg Waltner

Michael Opitz

Horst Bischof

Institute for Computer Graphics and Vision

Graz University of Technology, Austria

{waltner, opitz, bischof}@icg.tugraz.at

Abstract. We propose a model able to learn new ob-

ject classes with a very limited amount of training

samples (i.e. 1 to 5), while requiring near zero run-

time cost for learning new object classes. After ex-

tracting Convolutional Neural Network (CNN) fea-

tures, we discriminatively learn embeddings to sep-

arate the classes in feature space.

The proposed

method is especially useful for applications such as

dish or logo recognition, where users typically add

object classes comprising a wide variety of represen-

tations. Another benefit of our method is the low de-

mand for computing power and memory, making it

applicable for object classification on embedded de-

vices. We demonstrate on the Food-101 dataset that

even one single training example is sufficient to rec-

ognize new object classes and considerably improve

Table 1. One-shot learning results: The top row shows the

results over the probabilistic Nearest Class Means

used training sample (blue), the other rows are the first

(NCM) formulation.

6 results where the our proposed method yields different

results than the probabilistic NCM version. Green fram-

ing indicates our improved NCM version is correct, while

1. Introduction

red stands for the opposite. From left to right: bibimap,

creme brulee, hot dog, lobster roll sandwich, seaweed

With recent advances in object recognition [7],

salad, spring rolls.

off-the-shelf features which are learned from a large

number of annotated images have become freely

available. As datasets grow, it will become increas-

ber (n-shot) of training samples. The main purpose

ingly computationally demanding to extend models

of these methods is to recognize new object classes

built from these features to new object classes. Con-

from a very limited number of training samples. This

sider images of different food types - Japanese and

is especially useful for open-ended recognition sce-

European users will have quite different imaginations

narios, such as logo detection or food recognition,

of an average lunch meal. Adapting a pretrained

where the number of object classes steadily grows

classifier to recognize meals that have not been seen

during the life cycle of an object recognition system.

during the training procedure is desirable. Similar

However, integrating new classes in a classifier pre-

to cognitive capabilities of humans, who can learn

trained on different classes is not straightforward. On

new classes from only very few samples, we aim for

the one hand a new class often exhibits large vari-

a computer vision system where new classes can be

ations, on the other hand the classifier trained on

added incrementally. In this work, we consider clas-

seen classes may not be capable to generalize to new

sification methods which can integrate previously un-

classes. Retraining state-of-the-art classifiers such as

seen classes from a single (one-shot) or a small num-

CNN every time after such class additions leads to



Figure 1. Overview of our method: After extracting CNN features from a fine-tuned CNN, we normalize the feature vectors and learn NCM or LMNN embeddings. These embeddings separate the classes in a way, so that newly added classes can be inserted without much loss in overall accuracy compared to insertion without a trained embedding. We leave the CNN and the trained embeddings fixed when adding new classes.

high accuracy, but is computationally inefficient and

gram Learning (HBPL), where characters are mod-

requires a significant amount of memory. For practi-

eled as a composition of primitives under certain

cal application this is often prohibitive, especially for

causality constraints. Another approach for exten-

embedded systems. An ideal system should there-

sion to new classes or categories is the use of at-

fore be able to integrate new object classes from few

tributes [9, 11], that can be seen as semantic de-

very seen samples on the fly; without the need for

scriptors which are shared by multiple classes. New

time- and memory-consuming retraining of the clas-

classes are then added by generating a semantic de-

sifier and negligible performance loss.

scription of the new class via attributes. Attribute-

based approaches have been used for animal catego-

2. Related Work

rization and recognition [9] or for human-nameable

visual attributes [11]. Similar to our idea of learning

One approach in related works considering n-shot

an optimal embedding, [6] statistically infers a Ma-

settings is the use of Bayesian learning, where prob-

halanobis distance metric on similar and dissimilar

abilistic estimates are used to extend the algorithms

feature pairs. In [16], a classifier is trained discrimi-to new classes.

For example, [3] fit probabilistic

natively for nearest prototype classification. Another

density functions as category models and use them

distance metric learning approach was presented in

as prior knowledge for new classes, while using

[15], where the authors propose Large Margin Near-

one or more samples for generation of the poste-

est Neighbor (LMNN) for classification. The LMNN

rior model of the new class. Hierarchical Bayesian

classifier learns a Mahalanobis metric, so that same

models are used by [12], where super-categories are

class samples are contracted and samples from differ-

automatically discovered based on available classes

ent classes are pushed apart from each other. These

and serve as prior information to incorporate new

methods are able to generalize to previously unseen

classes. In [8], authors investigate one-shot learn-

samples, but in contrast to our approach they do not

ing of characters using Hierarchical Bayesian Pro-

regard insertion of previously unseen object classes.

where Nc is the number of samples x for class c,

We employ a Nearest Class Mean (NCM) clas-

f is a feature extraction function and θ are model-

sifier [10, 14] for classification. Object classes in

parameters. To predict the object class of a sample

NCM classifiers are represented by the mean fea-

we seek the minimum distance to all class means by

ture vector of the corresponding class samples and

computing

can be easily extended to new classes by computing

arg min kf (x; θ) − µ k ,

(2)

the mean over newly added training samples. Our

c 2

c=1,...,C

method learns discriminative embeddings to better

separate the classes in feature space. Other than the

where C = |C| is the total number of classes.

probabilistic NCM approach of [10], we use CNN

3.1. Feature Extraction

features. We propose the hinge loss for optimization

and show that this improves overall accuracy. Addi-

Motivated by their recent success in image recog-

tionally, we show how to robustify the learned NCM

nition tasks, we utilize CNNs for feature extraction.

embeddings. In contrast to other approaches for one-

Instead of training a deep network from scratch, we

shot and n-shot learning, our method does not need

take a CNN model trained for the ImageNet Chal-

access to the full dataset as we do not employ clas-

lenge [7] and fine-tune on our task-specific training

sifier retraining, enabling the use of our system for

data. This can be seen as domain transfer from one

embedded platforms like smartphones, where com-

task to another and has proven to be superior to hand-

puting power and storage is limited. Furthermore, the

crafted features [4]. As the later layers of the net-

most one-shot algorithms are Bayesian methods and

work correspond to high-level features, we use the

use prior knowledge from the training data to gen-

last fully connected layer as 4096-dimensional fea-

erate posterior probabilities for new classes. We do

ture representation and normalize each feature vector

not model such probabilities, but rely on the learned

by dividing by its l2-norm.

feature embeddings only. Figure 1 gives an overview

3.2. Embedding

of our method: We use l2-normalized CNN features

and learn additional layers that embed the features

After fine-tuning the CNN, we employ several dis-

in an optimal way. After that we add new classes

tance metric learning methods to learn a discrimina-

d×4096

to evaluate the incremental learning capability of our

tive linear embedding matrix W ∈ R

, with

classifier. Table 1 shows one-shot learning results for

d ∈ {1024, 4096}. This embedding projects samples

some classes of the Food-101 dataset [2].

from the same object class next to each other in a

high dimensional feature space, while simultanously

3. One-Shot and N-Shot Classification

pushing samples from different object classes far

away from each other. Using the embedding W, the

In the n-shot classification setting the classifier ex-

class prediction from Equation (2) becomes

tends to new classes from a very limited number of

samples (i.e. 1 to 5). NCM classifiers store a mean

arg min kW · f (x; θ) − W · µ k

c

,

(3)

2

vector for each object class they recognize. This has

c=1,...,C

the advantage that recognition of new classes can be

In this work, we consider optimizing NCM loss func-

incorporated by simply computing mean vectors for

tions and the LMNN loss with respect to W to learn

these classes. Mean vectors can be efficiently com-

our embedding. In the remainder of this section, we

puted online, eliminating the need of explicitly stor-

will formally explain the different methods.

ing feature vectors of all training samples that the

NCM. As proposed in [10], embeddings for NCM

class mean originates from. For one-shot learning,

classifiers are usually learned by minimizing the neg-

the class mean corresponds to one added class sam-

ative log-likelihood. The posterior probability p(c|x)

ple, for n-shot learning the mean is calculated from

of class c given a sample x is defined as

n samples of a new class. More formally, let µc be

the mean vector for the c-th class from the set C of

e−δ(x,µc;θ)2

p(c|x) =

,

(4)

available classes, defined as

PC

e−δ(x,µi;θ)2

i=1

N

where δ is defined as

1

c

X

µc =

f (xi; θ),

(1)

Nc

δ(x, µ; θ) = kW · f (x; θ) − W · µk .

(5)

i=1

2

To learn the embedding W, minimize the negative

space. Let x be a data sample, y the class label and

log-likelihood

θ the parameters of the feature extraction function f .

We propose to train a NCM layer on top of the CNN

N

1 X

features with the following NCM loss function

L = −

ln p(yi|xi)

(6)

N i=1

X

L(x, y; θ) = λ · δ2y +

max(0, 1 + δy −δc)2, (11)

of sample xi and its corresponding class label yi. In

c∈C\{y}

subsequent sections we refer to this method as prob-

where δ

abilistic NCM (N CM

y = δ(x, y; θ) and δc = δ(x, c; θ) are dis-

P ), since we are optimizing a

tance functions as defined in Equation (5) and λ is a

negative log-likelihood function.

weighting parameter. The first part enforces the sam-

LMNN. The loss function of the LMNN embed-

ples of one class to be embedded near the class mean

ding [15] consists of two terms. One adds a penalty

of the data sample, while the second term penalizes if

for samples that share the same class label but exceed

samples are within the margin of other class means.

a certain distance (margin), while the other penalizes

This large margin version of the NCM classifier will

samples with different class labels that are close in

be referred to as N CM

feature space. The loss is calculated on triplets in-

LM

stead of pairs, where a sample is complemented by a

3.4. Robust NCM

sample of the same and a sample of a different class.

Due to variations in shape, illumination and ap-

The set of triplets is given by

pearance, feature vectors from an object class usu-

D = {(i, j, k) : y

ally exhibit intra-class variance. We model this un-

i = yj , yi 6= yk}

(7)

certainty by assuming that a feature vector for a sam-

and 1 ≤ i < j < k ≤ N , the LMNN loss function

ple x associated with class c is generated by a normal

over the triplet set is then defined as

distribution N (µc, σc). We incorporate this variation

in our model by computing the standard deviation σc

X

L(D) =

d2ij + lijk.

(8)

for all classes c ∈ C over the training set. During op-

(i,j,k)∈D

timization we add random noise to our feature vec-

tors to account for this uncertainty. More formally,

The distance function d of two samples xi and xj is

the loss function we minimize is

dij = kW · f (xi; θ) − W · f (xj; θ)k

(9)

2

X

L(x, y; θ) = λ · ˆ

δ2y +

max(0, 1 + ˆ

δy − ˆ

δc)2, (12)

and the triplet loss function l

c∈C\{y}

ijk is defined as

where

lijk = max(0, 1 + dij − dik).

(10)





ˆ

δ(x, µ; θ) = W · ˆ

f (x; θ) − W · µ .

(13)

This embedding maximizes the distance in feature



2

space between samples of different classes (xi, xk),

and

while concentrating samples that belong to the same

1

ˆ

f = f (x) + Σ 2

class (x

y · γ.

(14)

i, xj ).

Following [13], during training we

perform “hard” negative mining of triplets which vi-

Σy is the diagonal covariance matrix of class y,

olate the margin constraint imposed by l

4096

ijk . Oppo-

∈ R

∼ N (0, 1) is a random vector drawn from

site to “soft” negatives, which do not violate the mar-

a normal distribution and γ is a hyper-parameter,

gin or violate the margin by only a small amount,

which defines the impact of the distortions. In our

“hard” negatives impose a high loss and therefore

experiments we fix λ to 0.01 and γ is set to 0.5. Dur-

lead to faster training of the model and increased per-

ing training we first compute the standard deviation

formance.

of each feature per object class. We then add the

noise to our feature vectors, to make the embedding

3.3. Large Margin Nearest Class Mean Classifiers

W more robust against inter-class variations. This ro-

Inspired by LMNN we propose a large margin loss

bustification is done in real time during training and

function for NCM classifiers. Ideally, samples from

can be seen as data augmentation, making the impact

the same class are close to their own mean vector

of outliers on the means smaller. We refer to this

and are far away from other mean vectors in feature

method as N CMLM-R.





we train our network with Stochastic Gradient De-

scent (SGD) and momentum. We follow standard

fine-tuning protocols [7] and use a low initial learn-

ing rate of 0.001 and a momentum of 0.9. We anneal

the learning rate by a factor of 10 after each 20.000

iterations. To determine convergence, we measure

the accuracy on a validation set after 500 gradient

updates.

We optimize our embeddings with SGD and mo-

mentum. For training the embeddings, when not oth-

erwise stated, we fix the weights of the CNN and

train just the last embedding layer. This allows us to

use large learning rates of 0.25-0.5 with a momentum

term of 0.9. Further, we use large minibatch sizes of

1024 and train for about 20 epochs. We exponen-

tially anneal the learning rate at epoch 15 and 18. To

determine convergence, we measure the accuracy on

our validation set after each training epoch.

In our experiments we use Caffe [5] for fine-

tuning, while the evaluations on the embedding

methods are implemented in Python utilizing the

Theano library [1].

Figure 2. First 8 samples of randomly chosen classes from

the Food-101 dataset [2]. From top to bottom: baklava,

beef carpaccio, chicken curry, chocolate mousse, fried

4.1. Experiments with Known Classes

rice, gnocchi, miso soup, panna cotta, scallops, tacos.

To obtain feature representations and a softmax

baseline, we fine-tune the pretrained ImageNet Caf-

4. Experiments

feNet model from [7] on the 50 training classes

For evaluation of our method we use the publicly

from the Food-101 dataset as described above. In

available Food-101 dataset [2]. It consists of 101

the following, we compare our methods to the soft-

food classes with 1000 images per class. The images

max classifier (CN Nsoftmax) and to a probabilistic

were taken in real world environments, exhibiting a

NCM (N CMP ) version related to the work of [10].

lot of variation in illumination conditions or food ar-

The first results are obtained by nearest class mean

rangement (see Figure 2 for some examples) and are

classification, using euclidean (CN Neuc) and co-

well suited for the targeted application case where

sine (CN Ncos) distance measures between the class

users add data continuously.

means of all trainings samples and the test sam-

Following the protocol in [2], we randomly split

ples. Subsequently we train our NCM and LMNN

the 1000 samples of each class into 750 for training

embedding layers on top of the fine-tuned net-

and 250 for testing. For training of the CNN, we

work while leaving the net weights fixed (N CMLM ,

then apply a 80%/20% split for training and valida-

N CMLM-R, LM N N ). A summary of the results is

tion (600 and 150 samples respectively). This results

depicted in Table 2.

in a training-, validation- and test-set with 60.600

Interestingly, the nearest class mean classifica-

(60%), 15.150 (15%) and 25.250 (25%) samples, re-

tion performs better than the probabilistic version of

spectively. Further from the 101 classes we randomly

NCM, implying that the CNN features already sep-

select 50 training classes on which we train our clas-

arate the classes well. Our robust NCM version im-

sifiers and 51 classes on which we evaluate the gen-

proves results over the probabilistic version by about

eralization capability of our method to novel classes.

2% and is very competitive in comparison to the end-

For the sake of completeness, we also evaluate the

to-end trained softmax classifier of the network. The

embeddings on the 50 training classes only.

NCM embedding trained on the hinge loss and the

For fine-tuning CaffeNet on the Food-101 dataset,

LMNN embedding also reach comparable accuracy.

Method

Emb.

n = 1

n = 5

n = 10

n = 20

n = 50

n = 100

CN Neuc

−

44.15 ± 0.08

49.30 ± 0.31

54.26 ± 0.30

57.66 ± 0.20

60.03 ± 0.15

60.83 ± 0.13

CN Ncos

−

44.92 ± 0.24

49.82 ± 0.34

53.92 ± 0.32

57.26 ± 0.21

59.86 ± 0.16

60.76 ± 0.13

LDA

−

44.51 ± 0.01

44.92 ± 0.12

45.85 ± 0.12

48.44 ± 0.18

57.87 ± 0.17

63.61 ± 0.16

N CMP

1024

45.55 ± 0.33

50.11 ± 0.32

51.89 ± 0.25

53.08 ± 0.20

54.03 ± 0.13

54.39 ± 0.09

N CMP

4096

45.62 ± 0.34

50.23 ± 0.33

52.03 ± 0.25

53.23 ± 0.20

54.20 ± 0.14

54.57 ± 0.09

N CMLM

1024

46.23 ± 0.32

51.46 ± 0.34

53.49 ± 0.26

54.88 ± 0.18

56.02 ± 0.15

56.44 ± 0.11

N CMLM

4096

45.97 ± 0.28

51.43 ± 0.34

53.51 ± 0.27

54.93 ± 0.19

56.06 ± 0.13

56.50 ± 0.11

N CMLM-R

1024

46.30 ± 0.32

51.95 ± 0.35

54.15 ± 0.26

55.67 ± 0.21

56.89 ± 0.15

57.37 ± 0.11

N CMLM-R

4096

46.28 ± 0.32

51.94 ± 0.35

54.13 ± 0.28

55.66 ± 0.21

56.85 ± 0.15

57.31 ± 0.11

LM N N

1024

45.78 ± 0.25

51.60 ± 0.32

53.82 ± 0.28

55.34 ± 0.21

56.57 ± 0.16

57.05 ± 0.12

LM N N

4096

45.29 ± 0.18

51.58 ± 0.33

54.10 ± 0.28

55.81 ± 0.21

57.14 ± 0.14

57.63 ± 0.11

SV M

−

46.52 ± 0.40

50.02 ± 0.35

52.25 ± 0.30

55.08 ± 0.24

59.74 ± 0.19

63.38 ± 0.17

Table 3. Classification accuracy over the full Food-101 test-set (250 samples per class) after adding n ∈ {1, 5, 10, 20, 50, 100} training samples for each of the 51 test-classes. Accuracy and standard deviation are calculated over 100 runs. The baseline accuracy for end-to-end training of the CNN on all classes with all available data is 66.63%. The best and the second best result in each column is shown in bold and underlined.

Method

Emb. size

Accuracy

pute the class means that represent the new classes.

CN Neuc

−

68.60

Table 3 shows, that fine-tuning the network in the

CN Ncos

−

68.64

training phase (known classes) with metric learning

N CMP [10]

1024

67.66

methods generally improves accuracy in the testing

N CMP [10]

4096

67.75

phase for smaller values of n. Training the CNN with

N CMLM

1024

69.00

Caffe on the full dataset of 101 classes converges af-

N CMLM

4096

69.14

ter approximately 100.000 iterations to 66.63%.

N CMLM-R

1024

69.68

We also trained two more standard classifiers on

N CMLM-R

4096

69.61

the CNN features, namely SVM and LDA. It is re-

LM N N

1024

69.20

markable, that although the SVM has access to the

LM N N

4096

69.11

full dataset, the performance compared to our pro-

CN Nsoftmax

−

70.26

posed methods is inferior for n ∈ {5, 10, 20}. The

Table 2. Classification results of the 50 classes used for

same applies for utilizing a LDA classifier, where

fine-tuning the CNN model for feature extraction. Our

proposed robust NCM version reaches almost the same

only a big number of new samples achieves a per-

accuracy as the end-to-end trained softmax classifier while

formance improvement compared to our proposed

improving the results over the standard probabilistic NCM

methods.

classifier by 2%. Best (bold) and second best (underlined)

embeddings are marked.

5. Conclusion

4.2. Introducing Unseen Classes

We introduced embedding methods for one-shot

To assess how our method generalizes to new

and n-shot object class recognition. Our proposed

classes from only a limited number of samples, we

extensions to NCM classifiers consistently improve

use n random samples from the training set of the re-

the accuracy over the standard NCM training formu-

maining 51 classes to compute the mean vectors from

lation in a scenario where the amount of classes to

the output of the embeddings. The embeddings and

be recognized by the classifier doubles. Our meth-

the CNN remain fixed and are not retrained, hence

ods perform best for settings where only very few

the addition of new classes reduces to storing the new

new samples (n ≤ 10) per class are available. The

class means. We choose n ∈ {1, 5, 10, 20, 50, 100}

extension of the classifier to new object classes is in-

and report the accuracy on the full Food-101 test-

dependent of the old training data and is efficient in

set, where every class is represented by 250 samples.

terms of computational expense and memory. This is

Since for small values of n the results might have a

especially useful for recognition systems running on

large standard deviation, we repeat these experiments

embedded devices, where CPU power and memory

100 times using different training samples to com-

is limited.

Acknowledgements

[12] R. Salakhutdinov, J. B. Tenenbaum, and A. Torralba.

One-Shot Learning with a Hierarchical Nonparamet-

This work was supported by the Austrian Re-

ric Bayesian Model. In Workshop on Unsupervised

search Promotion Agency (FFG) under the projects

and Transfer Learning in conjunction with the Inter-

MANGO (836488) and DIANGO (840824).

national Conference on Machine Learning, 2012.

[13] F. Schroff,

D. Kalenichenko,

and J. Philbin.

References

FaceNet: A Unified Embedding for Face Recogni-

[1] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin,

tion and Clustering. In IEEE Conference on Com-

R. Pascanu, G. Desjardins, J. Turian, D. Warde-

puter Vision and Pattern Recognition, 2015.

Farley, and Y. Bengio. Theano: a CPU and GPU

[14] A. R. Webb and K. D. Copsey. Statistical Pattern

Math Expression Compiler. In Proceedings of the

Recognition. Wiley, 3rd edition, 2011.

Scientific Computing with Python Conference, June

[15] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Dis-

2010.

tance Metric Learning for Large Margin Nearest

[2] L. Bossard, M. Guillaumin, and L. Van Gool. Food-

Neighbor Classification. In Advances in Neural In-

101 – Mining Discriminative Components with Ran-

formation Processing Systems, 2005.

dom Forests. In European Conference on Computer

[16] P. Wohlhart, M. Köstinger, M. Donoser, P. M. Roth,

Vision, 2014.

and H. Bischof.

Optimizing 1-Nearest Prototype

[3] L. Fei-Fei, R. Fergus, and P. Perona.

One-Shot

Classifiers. In IEEE Conference on Computer Vision

Learning of Object Categories.

IEEE Transac-

and Pattern Recognition, 2013.

tions on Pattern Analysis and Machine Intelligence,

28(4):594–611, 2006.

[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik.

Rich Feature Hierarchies for Accurate Object Detec-

tion and Semantic Segmentation. In IEEE Confer-

ence on Computer Vision and Pattern Recognition,

2014.

[5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,

J. Long, R. Girshick, S. Guadarrama, and T. Darrell.

Caffe: Convolutional Architecture for Fast Feature

Embedding. arXiv preprint arXiv:1408.5093, 2014.

[6] M. Köstinger, P. Wohlhart, P. M. Roth, and

H. Bischof. Joint Learning of Discriminative Pro-

totypes and Large Margin Nearest Neighbor Clas-

sifiers. In IEEE International Conference on Com-

puter Vision, 2013.

[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Ima-

geNet Classification with Deep Convolutional Neu-

ral Networks. In Advances in Neural Information

Processing Systems, 2012.

[8] B. M. Lake, R. R. Salakhutdinov, and J. Tenen-

baum. One-Shot Learning by Inverting a Compo-

sitional Causal Process. In Advances in Neural In-

formation Processing Systems, 2013.

[9] C. H. Lampert, H. Nickisch, and S. Harmel-

ing. Learning to Detect Unseen Object Classes by

Between-class Attribute Transfer. In IEEE Confer-

ence on Computer Vision and Pattern Recognition,

2009.

[10] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka.

Distance-Based Image Classification: Generalizing

to New Classes at Near-Zero Cost. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

35(11):2624–2637, 2013.

[11] D. Parikh and K. Grauman. Relative Attributes. In

IEEE International Conference on Computer Vision,

2011.



21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

Cuneiform Detection in Vectorized Raster Images

Judith Massa1, Bartosz Bogacz1, Susanne Krömker2 and Hubert Mara1

Interdisciplinary Center for Scientific Computing (IWR)

1Forensic Computational Geometry Laboratory (FCGL)

2Visualization and Numerical Geometry (NGG)

Heidelberg University, Germany

{judith.massa|bartosz.bogacz|susanne.kroemker|hubert.mara}@iwr.uni-heidelberg.de Abstract. Documents written in cuneiform script

are one of the largest sources about ancient history.

The script is written by imprinting wedges (Latin:

cunei) into clay tablets and was used for almost

four millennia. This three-dimensional script is typi-

cally transcribed by hand with ink on paper. These

transcriptions are available in large quantities as

raster graphics by online sources like the Cuneiform

Database Library Initative (CDLI). Within this arti-

cle we present an approach to extract Scalable Vec-

Figure 1: A tracing of the tablet VAT6546 [23]

tor Graphics (SVG) in 2D from raster images as we

previously did from 3D models. This enlarges our

writing could produce robust documents. Therefore,

basis of data sets for tasks like word-spotting. In the

the content of cuneiform tablets ranges from simple

first step of vectorizing the raster images we extract

shopping lists to treaties between empires. The num-

smooth outlines and a minimal graph representation

ber of known tablets is assumed to be in the hundreds

of sets of wedges, i.e., main components of cuneiform

of thousands, which is constantly increasing as new

characters.

Then we discretize these outlines fol-

tablets are excavated by archaeologists on a regular

lowed by a Delaunay triangulation to extract skele-

basis. By roughly estimating the number of words

tons of sets of connected wedges. To separate the

on those tablets, we can assume that the total amount

sets into single wedges we experimented with differ-

of text in cuneiform script is comparable to those in

ent conflict resolution strategies and candidate prun-

Latin or Ancient Greek.

ing. A thorough evaluation of our methods and its

Since 1999, a number of projects have been

parameters on real word data shows that the wedges

launched to facilitate the work of Assyriologists. The

are extracted with a true positive rate of 0.98. At

Digital Hammurabi Project is concerned with the

the same time the false positive rate is 0.2, which re-

digitization of cuneiform tablets [27]. Achievements

quires future extension by using statistics about geo-

of the project include the creation of high-resolution

metric configurations of wedge sets.

3D models [17] as well as 3D and 2D visualiza-

1. Introduction

tion techniques for the models. Similarly, projects

in Leuven deal with the efficient production of 3D

Documents were written in cuneiform script for

models of tablets [28] and techniques to visualize the

more than three millenia in the ancient Middle

models [13]. The Cuneiform Digital Library Initia-

East [26]. Cuneiform characters were typically writ-

tive [15] incorporates a number of projects aimed at

ten on clay tablets by imprinting a rectangular sty-

cataloging cuneiform documents and making them

lus and leaving a wedge (cuneus in Latin) shaped

available online as transliteration, tracing and 2D im-

trace, i.e., triangular markings. As clay was always

age. In [11], the software framework CuneiformAna-

cheaply and easily available, everybody capable of

lyzer is introduced. It assists the researchers in script





3. Implementation

Our implementation proceeds in five distinct

steps. The first three steps transform raster image

transcriptions into a set of skeletonized wedge con-

stellations. The final two steps extract and prune

(a)

(b)

wedge candidates from these constellations.

Figure 2: (a) Paths described by four distinct wedge

shapes and (b) the path described by the compound

3.1. Vectorization

shape formed by them.

For the vectorization step, Selinger’s potrace algo-

rithm 1 is used. A directed graph G1 is constructed

analysis by detecting and segmenting wedge impres-

by traveling along the edges between black and white

sions of 3D models [10]. Furthermore, the program

pixels. Thereby, each vertex v in the graph corre-

simplifies collation of fragments and reconstruction

sponds to a pixel corner, which is adjacent to four

of tablets with methods from 3D computer graphics.

pixels in the bitmap image of which at least one

The GigaMesh project contributes with visualization

has to be black and one has to be white. An edge

methods and extraction of cuneiform characters from

(vi, vi+1) between two vertices is created if the cor-

tablets [21, 20].

responding corners are neighbors in the bitmap im-

Our method extracts wedge-shaped impressions

age and the edge separates a black and a white pixel.

from raster images. These images are hand-drawn

Then, a path p = {v0, . . . , vn} is a sequence of ver-

transcriptions of cuneiform tablets of varying quality

tices, where there is an edge between each pair of

and two different styles of marking wedges. We vec-

consecutive vertices vi and vi+1 for i = 0, . . . , n − 1.

torize images of the transcriptions and match patterns

A path is called closed if v0 = vn. Whenever a

and shapes to detect these constellations of wedges in

closed path is found, the color of the pixels enclosed

the vectorized transcriptions.

by it is inverted. The algorithm is applied recursively

to the new image until there are no black pixels left.

2. Related Work

For each of the resulting paths a polygon is calcu-

lated. Therefore another directed graph G

In [7] the problem of content-based image re-

2 is con-

structed, where each edge represents a straight path

trieval of Scalable Vector Graphics (SVG) docu-

and the set of vertices of Graph G

ments is tackled. Their approach uses a description

2 is a subset of

the vertices of Graph G

language to simplify comparisons between shapes. It

1 reduced to the endpoints

of the straight paths. A path p = {v

represents an object by a basic shape, like a unit cir-

0, . . . , vn} is

called straight if for all index triples (i, j, k) with

cle, and a transformation entailing its scale and trans-

0 ≤ i < j < k ≤ n there exists a point w on the

lation from the origin. The resulting framework han-

straight line through v

dles composites of simple SVG shapes, but no SVG

i and vk such that d(vj , w) ≤

1. The function d denotes the Euclidean norm. Fur-

path elements, which are able to represent arbitrary

thermore, not all four possible vertex-to-vertex direc-

shapes. The similarity measure chosen is a weighted

tions v

sum of shape, color, transformation, spatial and po-

(i+1) − vi may occur in the path (Figure 3).

Each edge then is assigned a penalty P

sition similarity.

i,j for us-

ing the corresponding straight path for the resulting

In [18] the problem of hierarchically clustering

polygon. The penalty is the product of the Euclidean

shapes described as vector graphics is addressed.

length and the standard deviation of vertex distances.

Based on [29], Kuntz uses Fourier descriptors [6] to describe and compare single basic SVG shapes. The

j

descriptors serve as feature vectors which then are

X

Di,j =

dist(vk, vivj)2

(1)

clustered using state-of-the-art clustering algorithms.

k=i

We cannot apply Kuntz’ method since it does not

r

1

deal with shapes that are part of a compound de-

Pi,j = |vi − vj| ·

· Di,j

(2)

j 	 i + 1

scribed as one object. Yet, our input consists of SVG

path elements that usually describe such compound

1http://potrace.sourceforge.net. Project page of potrace. Last

shapes (Figure 2).

visited on 4/11/15.



as upper bound:

2

X

s ≤

kCi+1 − Cik.

(4)

i=0

Since smaller distances along the path can only

improve the quality of the resulting skeleton, this up-

Figure 3: Shows how potrace checks if paths are

per bound is used for discretizing the silhouette.

straight. The dots represent the vertices of the paths

and the squares the 1/2-neighborhoods of the ver-

3.3. Skeletonization

tices. Paths in (a), (b) and (d) are straight and (c)

In order to deal with occluding wedge marks in

and (e) are not.

the detection step, shape skeletons are used as in-

termediate representations. Different definitions for

shape skeletons have been stated since [2, 22, 25, 19].

with j 	i = j −i if i ≤ j and j 	i = j −i+n if j ≤ i

−

→

Based on [25], we define a shape skeleton as the in-

and dist(a, cd) the Euclidean distance of a point to

finite set of points within the shape boundaries that

a straight segment. Finding an optimal polygon then

have more than one closest point on the shape out-

is the equivalent to finding an optimal cycle in graph

line. The skeleton can be computed efficiently with

G2, with the quality measured by the tuple (k, P ),

time complexity O(n log n) by using Voronoi dia-

where k is the number of straight paths that make up

grams [16]. The Voronoi diagram for a set of sites S

the cycle and P is the sum of respective penalties.

divides a space into |S| partitions called the Voronoi

With that, a polygon with a smaller penalty but more

regions. In

2

R

a Voronoi region is the interior of

segments is considered worse than a polygon with

a convex polygon, whose boundaries, the Voronoi

less segments but higher penalty.

edges, are equidistant to two of the input sites. As

After choosing a polygon, Bézier curves are cal-

input sites, the polygon vertices obtained in the pre-

culated and by doing this, a smoothing of the corners

vious discretization step are used.

is achieved where it seems reasonable. Optionally,

The Voronoi diagram (Figure 4b) is computed by

consecutive curves are joined if the segments agree

solving the dual problem first: the Delaunay trian-

in convexity and the total direction change does not

gulation (Figure 4a). Each Voronoi vertex represents

exceed 89 degrees.

the circumcenter of a Delaunay facet and a Voronoi

ridge connects two such points of neighboring facets.

3.2. Discretization

An implementation of the quickhull algorithm [1] is

used to calculate the 2-dimensional Delaunay trian-

We tested minimizing the maximum distance be-

gulation from a 3-dimensional convex hull.

tween the polygon line segments and the contour seg-

The Voronoi ridges with end points outside the

ments, but found that even if a polygon approximates

original shape boundaries and ridges crossing the

an arbitrary shape well, it is still not assured that the

contour are removed. The skeleton, represented as

resulting Voronoi skeleton will be a good approxima-

an undirected graph (Figure 4c), consists of more

tion to the skeleton of the wedge constellation. The

and in general shorter segments. These, in turn, are

key for a good discrete skeleton turned out to be the

made up of longer segments the more vertices form

limitation of the distance of two sample points along

the approximated polygon. Since the short segments

the shape outline. However, calculating the distance

are rarely meaningful, considering their directions, a

along a Bézier curve B(t) is a complex task [12]

new skeleton (Figure 4d) is constructed. The com-

since the length s of the complete curve is

putation is done by a graph traversal algorithm. It

follows a series of consecutive edges until an end

Z

1 q

s =

B0

node is incident to more than two edges in the origi-

x(t)2 + B0y (t)2dt,

(3)

0

nal skeleton.

3.4. Extraction

which has no closed-form solution. Yet, we know

that the curve length s of a Bézier of degree 3 has the

The basic shape of a wedge impression can be de-

sum of the distances of consecutive control points Ci

scribed by a Y- or T-junction. We call this junction





(a)

(b)

(a)

(b)

Figure 5: Two different ways of representing a wedge

impression in ink tracings: a) as unfilled contour and

b) as filled shape. The result of the bitmap trace is

drawn in red, the simplified Voronoi skeleton in blue.

(c)

(d)

Figure 4: Visualization of important steps of the

ure 5a). This fact is used to locate the wedge-head

skeleton computation and simplification process: (a)

of a wedge contour. A cycle in an undirected graph

the Delaunay triangulation, (b) the Voronoi diagram,

is an ordered set of vertices

(c) the inner elements of the Voronoi diagram and (d)

C = (v

the resulting skeleton.

0, v1, . . . , vn)

(5)

where circular consecutive vertices are adjacent and

the wedge-head and the ridges extending from the

no vertex appears twice.

junction the wedge-arms.

To avoid outliers to be taken into consideration as

After having computed a shape skeleton, the de-

wedge-heads, we only look for short cycles. Two

tection of the wedge-heads of the impressions can be

concepts of length are possible: the number of edges

approached. There are two different ways a wedge

forming the cycle

impression can be drawn: as contour lines or shapes

filled with ink. For a single wedge impression, the

ledge(C) = |C| ≤ tedge

(6)

filled shape results in a single closed curve after

and the accumulated distances ldist along the cycle

bitmap tracing and the shape contour is represented

path, using Pv as the coordinate of a node v, that can

as two closed curves, where only the area between

be formally defined by

both curves is filled with color. Usually, the repre-

sentation as unfilled shape contour is intended, but

|C|−1

X

for small wedges, the thickness of the pencil used

ldist(C) =

|Pv − P | ≤ t

i

vj

dist

(7)

for the original ink tracing sometimes leads to solid

i=0

shapes. The two representations result in two differ-

with vi, vj ∈ C and j = (i + 1) mod |C| thresholds

ent skeletons as shown in Figure 5. These two cases

tedge and tdist.

are considered separately and certainty values wloc

Using ledge as the cycle length, this results in a

are calculated for locations in the skeleton graph that

time complexity of O(|E| · tedge), using ldist, the

j

k

seem likely to contain a wedge-head. In both cases,

complexity will be O(|E| ·

tdist

min{kP

w

u−Pv k:(u,v)∈E }

loc ranges from zero to one with values close to one

in the worst case. These concepts are used next to

indicating a high probability of a wedge-head at the

each other in the algorithm.

considered location. The result of this step is a set of

The cycle extraction proceeds as follows:

A

wedge-heads for which the certainty value exceeds

depth-first search tree is built and the back-edges are

the threshold tcontour for contour wedges or tsolid for

loc

loc

extracted. For each back-edge (u, v) a depth-limited

solid wedges.

search for node v is conducted with u as root node;

paths from v to u represent cycles when joined with

Wedge-Head Detection of Shape Contours

Hav-

{(u, v)}, except the direct path {(v, u)}. At last, the

ing a wedge impression represented as a contour, the

set of cycles is reduced to contain only unique cycles.

respective skeleton graph shows a cycle resembling

A set of unique cycles contains no two cycles that

a triangle at the position of the wedge-head (Fig-

are equivalent, i.e., if they are induced by the same



set of graph edges. This is tested by:

C1 ≡ C2

⇐⇒

EC \ E

= ∅,

(8)

1

C2

where EC denotes the edge set of a cycle C.

The depth-limited search stops following a search

branch when either ledge or ldist exceed their respec-

tive thresholds, tedge or tdist, or when a target node

Figure 6: The triangle with area A∆ shows the best

is discovered. It returns a list of paths from the root

triangle for the blue polygon. Since the sum of the

to the target node.

error areas Aerr = A1+A2+A3 is almost equal to the

triangle area A∆, the polygon with area A∆+Aerris

Triangle Similarity

Once all unique cycles of a

not one of the polygons chosen for wedge-head posi-

skeleton graph have been extracted, their resem-

tions.

blance to a triangle can be analyzed. This can be

achieved by comparing the triangle with the small-

est error that can be created with the cycle’s vertices

always holds by definition of the Voronoi diagram.

with the original cycle (Figure 6). With

Therefore, any random s ∈ S(v) can be chosen to

get a measure for the distance to the contour and use

2

X

it as hint for plausible locations of heads of solid

AC,(i0,i1,i2)

err

=

A(ci ,c

)

(9)

j

ij +1 mod |C|,...,cij+1

wedges (Figure 7). Percentiles of the site-to-vertex

j=0

distances

being the sum of error areas between triangle and

polygon, we have

D = {d(Pv, s)|v ∈ S ∧ s ∈ S(v)}

(13)

(

C,(i

)

are used instead of the minimum and maximum to ac-

A

0,i1,i2)

err

wcontour

count for outliers. In order to arrive at a range from

loc

(C) = 1 − min

i

C,(i

0,i1,i2;

A

0,i1,i2)

C + Aerr

zero to one with values close to one for long distances

(10)

d(pv, s) and close to zero for short distances, the first

for 0 ≤ i0 < i1 < i2 ≤ |C| − 1, C = (c0, . . . , cn)

percentile is used as minimum dlower, the 99th per-

and n ≥ 2.

centile of these distances as maximum dupper and the

The advantage of this similarity measure is that the

position within this range is used as certainty value.

three vertices of the triangle are also vertices of the

skeleton graph, thus providing us with feasible start-

ing points for the wedge-arm tracing. Since the cy-

0

if d(P



v , s) ≤ dlower

cles form simple polygons, the enclosed areas can be



wsolid

loc

(v) =

1

if d(Pv, s) ≥ dupper

calculated with the shoelace or surveyor’s area for-



d∗(P

mula [4].

v , s)

else

(14)

Wedge-Head Detection of Solid Shapes

Solid im-

d(Pv, s) − dlower

prints have their centers at junctions v of a skeleton S

d∗(Pv, s) =

(15)

dupper − dlower

with a particularly long distance from the shape con-

tour. This distance is approximated by the distance

For a junction, where n edges meet, n wedge-

3

of the coordinate Pv of the vertex v to all sites s with

heads are retrieved.

s ∈ S(v), where S is the site set of the underlying

After locating the position of a wedge-head, the

Voronoi diagram and

extents of the impression are calculated.

For a

wedge-head, multiple wedges are proposed. Vertices

S(v) = {s ∈ S|Pv is vertex of V (s)}

(11)

{vi}i=1,2,3 must fulfill two conditions:

is the set of sites with Pv being a vertex of their

Voronoi region V (s). The equation

1. The line segment between the coordinates of a

wedge vertex vi and tracing start point S may

d(Pv, s1) = d(Pv, s2)

∀s1, s2 ∈ S(v)

(12)

not intersect with the shape boundary.





(a)

Figure 7: Possible locations of solid wedges are

found by looking at skeleton junctions: if their dis-

tance to the closest discretization points is above a

threshold, it is likely that the center of a wedge im-

pression is located here. Due to the definition of the

Voronoi skeleton, the equation d1 = d2 = d3 holds

at every skeleton junction.

(b)

Figure 8: The pink area shows the place where wedge

2. A vertex of a wedge must be located within the

vertices may be located given the wedge-head of (a)

infinite area between the lines through the co-

a contour wedge and (b) a solid wedge. The marked

ordinates of S and the wedge-head vertices vH

j

junctions are valid as wedge vertex since the straight

and vH as shown in Figure 8. The condition can

k

line to N does not cross the shape boundary. The

be checked by testing if

dotted line in (b) shows that if we had chosen N for

−−−→

α

solid wedges as for contour wedges, we would not be

](PSPv , ~v) ≤

(16)

i

2

able to find the correct wedge vertex.

with

−−−−→ −−−−→

α = ](P

P

P

P

vH

S ,

vH

S )

(17)

j

k

candidates that can be easily identified as improba-

−−−−→

ble by computing the angles between the arms. We

and ~

v being the angle bisector of P

P

vH

S and

j

−−−−→

want the arms to be evenly spread, so we punish an-

P

P

vH

S .

gles that deviate from 120 degrees. Having α1, α2

k

and α

For contour wedges, the tracing start node for a

3 as internal wedge angles, we use

wedge vertex vi is the respective wedge-head ver-

w

tex vH and for a solid wedge, the start node is the

angle(α1, α2, α3) = p(α1) · p(α2) · p(α3)

(18)

i

wedge center. The reason why for the contour ver-

with

tex, the line checked for the conditions above starts

120

at the head vertex instead of the wedge center is that

p(α) =

(19)

120 + |120 − α|

the center in this case is not a part of the shape skele-

ton and is typically located inside a hole in the shape.

as measure for the angle quality of wedges. This

From the respective start node the algorithm fol-

measure is used in a preliminary reduction step to

lows all paths within the area of valid nodes shown in

eliminate all wedges whose angle quality exceed a

Figure 8. A path may have sections of a certain num-

given threshold.

ber of nodes that are inadmissible as wedge vertices.

The simplest strategy using a threshold takes the

The algorithm returns multiple arms for one direction

remaining wedges and removes those that share

resulting in multiple wedge suggestions for a wedge-

heads with other wedges having higher angle qual-

head. If narm 1, narm 2 and narm 3 are the number of

ity. All of the other strategies proceed by iteratively

arms returned for the respective wedge-head vertex,

testing wedges against the set of chosen wedges and

the number of wedges is narm 1 · narm 2 · narm 3.

adding them to the result set if no conflicts arise.

Wedges are not added if the number of wedge-head

3.5. Wedge Set Reduction

edges, that are not used by any other wedge as head

The certainty measures for wedge-head locations

or arm edge, goes below a threshold. Furthermore,

wloc can only be used as first hints to possible loca-

contour wedges may not share a head edge with an-

tions. After wedge-arm tracing, we still get wedge

other wedge.





Balanced Strategies

For documents where the

number of solid wedges is greater or equal to the

number of contour wedges, we implemented bal-

anced strategies.

There are six different strate-

gies of this kind:

Balanced-loc, balanced-angle

and balanced-size sort the wedges by wloc, wangle

and size respectively. Balanced-sides-loc, balanced-

(a)

(b)

sides-angle and balanced-sides-size sort the wedges

first by the number of arms that contain at least one

Figure 9: ROC curves of the measures used for the

edge that is not already used by a chosen wedge. The

detector of (a) contour wedges and (b) solid wedges

second kind of balanced strategies recalculates the

number of free arms after each iteration. As mea-

sure for the size of a wedge, the average length of the

Contour Wedges

Candidates for contour wedges

lines from the center to the three edges is taken.

are found by searching for cycles in the skeleton

graph. The similarity of the cycle to a triangle is

Contour-Fill Strategies

Most documents contain

then used as quality measure for the detector for early

more contour wedges than solid wedges. For these

rejection of improbable locations for wedge heads.

documents, the contour-fill strategies have been

Figure 9a shows that early rejection is reasonable,

implemented.

The strategies are contour-fill-loc,

since the chosen function for the location proves to

contour-fill-angle and contour-fill-size, contour-fill-

be a good estimator. However, the green curve shows

sides-loc, contour-fill-sides-angle and contour-fill-

that the chosen measure for the angles between the

sides-size. They proceed as their respective balanced

arms of a reconstructed wedge is less optimal as dis-

counterpart but consider the set of contour wedges

criminator.

first before adding solid wedges to the set of cho-

Figure 10a shows the F-score for the contour

sen wedges. The candidate set of solid wedges only

wedge detector using thresholding only. It shows

calculated after the set of chosen contour wedges is

high scores of about 0.8 to 0.9 for thresholds for the

computed and vertices that are incident to a wedge-

location quality of about 0.7 to 0.9. The threshold for

head edge are excluded.

the angle quality should not be chosen too high, but

0.7 at most. The maximum score of 0.90 is achieved

4. Results

for tcontour = 0.79 and the threshold t

loc

angle = 0.45.

The algorithm has been tested on 94 tracings

from [23] and [14]. The groundtruth is determined by manually deciding for each cycle and skeleton junc-Solid Wedges

Candidates for solid wedges are

tion if they are valid positions of wedge-heads. Since

found by searching for skeleton junctions with great

a tracing rarely contains less than 500 wedge-marks,

distance to the shape contour. Figure 9b shows that

two typical tracings have been chosen for the evalua-

weight is a worse discriminator for solid wedges

tion. They differ in the representation of fractures, in

than for contour wedges. The discrimination qual-

size and in the percentage of solid wedges. As result

ity for the angle quality measure for solid wedges

we have 1252 annotated cycles and 3792 annotated

looks very similar to the respective curve for contour

junctions serving as groundtruth.

wedges.

For the evaluation of the discrimination capabili-

Figure 10b shows the F-scores for the solid wedge

ties of wloc and wangle, the Receiver Operating Char-

detector. For solid wedges, this detector achieves a

acteristic (ROC) [9] will be used. The ROC shows

score of about 0.62 at maximum. In contrast to the

the quality of a detector by assigning it a point in

F-scores for the contour wedge detector this score

the ROC space, with the false positive rate (FPR) as

is quite low. The reason for this can be found in

x-coordinate and the true positive rate (TPR) as y-

the fact, that there are a lot more locations to check

coordinate. Therefore, the point assigned to an opti-

since every skeleton junction is considered. Espe-

mal detector has the coordinates (0,1). As measure

cially when junctions are located next to each other,

for the overall performance of a discriminator func-

false hits occur frequently. The best score is achieved

tion, the F-score is given [24].

for tsolid = 0.68 and t

loc

angle = 0.56.





(a)

(b)

(a)

(b)

Figure 10:

F-scores for detector of (a) contour

wedges and (b) solid wedges.

(c)

Figure 13: Receiver Operating Characteristic (ROC)

space showing performance of wedge set reduc-

tion strategies for test case VAT6546 (a) for con-

tour wedges and (b) for solid wedges and (c) for all

wedges.

Figure 11: F-scores of reduction strategies for test

case VAT6546. The strategies are sorted in descend-

ing order by their performance.

5. Summary and Outlook

In this work we presented an algorithm that uses

bitmap tracing and skeletonization as intermediate

steps to detect wedge impressions in raster graphics

of cuneiform documents. We have shown the weak-

nesses of the measures used to construct an initial

wedge set and have shown how we can use conflict

set reduction strategies to improve the results signif-

icantly.

This work is part of ongoing research on opti-

cal character recognition for cuneiform characters [3]

Figure 12: The extracted wedge marks of test case

and used as one of many sources of wedge constella-

VAT6546.

tions. The presented method will allow us to perform

word spotting on raster image databases as the CDLI.

4.1. Wedge Set Reduction

We will also examine if statistic approaches as in [5]

or [8] can be used to enhance the detection results.

The reduction strategies serve to overcome the

shortcomings of pure thresholding.

We demon-

References

strate their differences with a tracing of the tablet

VAT6546 [23]. It represents fractures with lines and

[1] C. Barber, D. Dobkin, and H. Huhdanpaa.

The

Quickhull Algorithm for Convex Hulls.

ACM

shows 215 contour and 120 solid wedges. Figure 11

Transaction on Mathematical Software (TOMS),

shows the F-scores for this case. The best F-score

22(4):469–483, 1996. 3

is achieved by the contour-fill-sides-loc method with

[2] H. Blum. A Transformation of Extracting New De-

86% (Figures 12 and 11).

scriptors of Shape. In Models for the Perception of

Figure 13 compares the strategies concerning TPR

Speech and Visual Form, pages 362–380. MIT Press,

and FPR. It shows a clear ordering between simi-

1967. 3

lar methods that differ merely in the measure for the

[3] B. Bogacz, J. Massa, and H. Mara. Homogeniza-

sorting of the wedges.

tion of 2D & 3D Document Formats for Cuneiform

Script Analysis. In Proc. of the 3rd International

[17] S. Kumar, D. Snyder, D. Duncan, J. Cohen,

Workshop on Historical Document Imaging and

and J. Cooper.

Digital Preservation of Ancient

Processing (HIP15), 2015. 8

Cuneiform Tablets Using 3D-Scanning. In Proceed-

[4] B. Braden. The Surveyor’s Area Formula. The Col-

ings of Fourth International Conference on 3-D Dig-

lege Mathematics Journal, 17(4):326–337, 1986. 5

ital Imaging and Modeling, pages 326–333, 2003. 1

[5] M. Cammarosano, G. Müller, D. Fisseler, and F. We-

[18] M. Kuntz. Clustering SVG Shape. In 8th Inter-

ichert. Schriftmetrologie des Keils: Dreidimension-

national Conference on Scalable Vector Graphics,

ale Analyse von Keileindrücken und Handschriften.

2010. 2

Die Welt des Orients, 44(1):2–36, 2014. 8

[19] D. Lee. Medial Axis Transformation of a Planar

Shape. IEEE Transactions on Pattern Analysis and

[6] R. Cosgriff. Identification of Shape. ASTIA AD 254

Machine Intelligence (TPAMI), 4(4):363–369, 1982.

792 820-11, Ohio State University Research Foun-

3

dation, 1960. 2

[20] H. Mara and S. Krömker.

Vectorization of 3D-

[7] E. Di Sciascio, F. Donini, and M. Mongiello.

A

Characters by Integral Invariant Filtering of High-

Knowledge Based System for Content-based Re-

Resolution Triangular Meshes. In 12th International

trieval of Scalable Vector Graphics Documents. In

Conference on Document Analysis and Recognition

Proceedings of the 2004 ACM Symposium on Ap-

(ICDAR), pages 62–66, 2013. 2

plied Computing, pages 1040–1044, 2004. 2

[21] H. Mara, S. Krömker, S. Jakob, and B. Breuckmann.

[8] D. Edzard. Keilschrift. In ˘Ia... - Kizzuwata, volume 5

GigaMesh and Gilgamesh – 3D Multiscale Integral

of Reallexikon der Assyriologie und vorderasiatis-

InvariantCuneiform Character Extraction. In Pro-

chen Archäologie, pages 545–567. de Gruyter, 1980.

ceedings of the 11th International Symposium on

8

Virtual Reality, Archaeology and Cultural Heritage

[9] T. Fawcett. An Introduction to ROC Analysis. Pat-

(VAST), 2010. 2

tern Recognition Letters, 27(8):861–874, 2006. 7

[22] U. Montanari. Continuous Skeletons from Digitized

[10] D. Fisseler, F. Weichert, G. Müller, and M. Cam-

Images. Journal of the ACM (JACM), 16(4):534–

marosano.

Towards an interactive and automated

549, 1969. 3

script feature analysis of 3D scanned cuneiform

[23] O. Neugebauer, editor. Register, Glossar, Nachträge,

tablets. In The 4th Conference on Scientific Com-

Tafeln, volume 2 of Mathematische Keilschrift-

puting and Cultural Heritage (SCCH), pages 1–10,

Texte. Springer, 1935. 1, 7, 8

2013. 2

[24] D. Powers. Evaluation: From precision, recall and

[11] D. Fisseler, F. Weichert, G. Müller, and M. Cam-

f-measure to roc, informedness, markedness & cor-

marosano.

Extending Philological Research with

relation. Journal of Machine Learning Technologies,

Methods of 3D Computer Graphics Applied to Anal-

2(1):37–63, 2011. 7

ysis of Cultural Heritage.

In 12th Eurograph-

[25] F. Preparata. The Medial Axis of a Simple Polygon.

ics Workshop on Graphics and Cultural Heritage

In Mathematical Foundations of Computer Science

(GCH), pages 165–172, 2014. 1

1977, pages 443–450. Springer, 1977. 3

[12] J. Gravesen. Adaptive Subdivision and the Length

[26] W. von Soden. The ancient Orient: an introduction

and Energy of Bézier Curves. Computational Ge-

to the study of the ancient Near East. Wm. B. Eerd-

ometry, 8(1):13–31, 1997. 3

mans Publishing Co., 1994. 1

[13] H. Hameeuw and G. Willems.

New Visualiza-

[27] L. Watkins and D. Snyder. The Digital Hammurabi

tion Techniques for Cuneiform Texts and Sealings.

Project. In Proceedings of Museums and the Web

Akkadica, 132(2):163–178, 2011. 1

(MW), 2003. 1

[14] S. Jakob.

Die mittelassyrischen Texte aus Tell

[28] G. Willems, F. Verbiest, W. Moreau, H. Hameeuw,

Chu¯era in Nordost-Syrien, volume 3 of Ausgrabun-

K. Van Lerberghe, and L. Van Gool. Easy and Cost-

gen in Tell Chu¯era in Nordost-Syrien. Harrassowitz,

Effective Cuneiform Digitizing. In The 6th Inter-

2009. 7

national Symposium on Virtual Reality, Archaeology

[15] J.

Kantel,

P.

Damerow,

S.

Köhler,

and

and Cultural Heritage (VAST), pages 73–80, 2005.

C. Tsouparopoulou. 3D-Scans von Keilschrifttafeln

1

– ein Werkstattbericht.

In 26. DV-Treffen der

[29] C. Zahn and R. Roskies. Fourier Descriptors for

Max-Planck-Institute, pages 41–62. Gesellschaft für

Plane Closed Curves. IEEE Transactions on Com-

wissenschaftliche Datenverarbeitung, 2010. 1

puters (TC), 21(3):269–281, 1972. 2

[16] D. Kirkpatrick. Efficient Computation of Continu-

ous Skeletons. In Proceedings of the 20th Annual

IEEE Symposium on Foundations of Computer Sci-

ence, pages 18–27, 1979. 3



21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

2D tracking of Platynereis dumerilii worms during spawning

Daniel Pucher, Walter G. Kropatsch, Nicole M. Artner

Pattern Recognition and Image Processing (PRIP)

Vienna University of Technology, Austria

http://www.prip.tuwien.ac.at

Stephanie Bannister , Kristin Tessmar-Raible

Max F. Perutz Laboratories

University of Vienna, Austria

https://www.mfpl.ac.at/

Abstract. Platynereis dumerilii are marine worms

that reproduce by external fertilisation and exhibit

particular swimming behaviours during spawning.

In this paper we propose a novel worm tracking

approach that enables the 2D tracking and feature

extraction during the spawning process of these

worms.

The gathered data will be used in the

future to characterise and compare male and female

spawning behaviours.

Figure 1. Image of a male (red) and female (yellow) worm.

1. Introduction

Platynereis dumerilii are marine polychaete

worms come into close contact and sense chemical

worms (Lophotrochozoa,

annelida,

nereididae),

pheromones secreted into the water by the opposite

which swim only when sexually mature, in order

sex. This is accompanied by a noticeable change

to reproduce. The timing of reproductive spawning

in swimming behavior for both sexes: swimming

events in this species is synchronized with the moon

speeds increase (particularly for males), and worms

phase, whereby spawning in nature occurs primarily

either begin to swim in circles, or swim in tighter

during new moon.

This together with chemical

circles (particularly for females). Other changes in

pheromone signaling allows mature male and female

the plane of swimming are more frequently observed

worms to locate one another and engage in spawning

in both sexes during engaged spawning behavior.

behaviors that constitute the nuptial dance. See

During gamete release, sperm and eggs are secreted

Figure 1 for an image of a male and female worm.

into the water, which particularly for female worms,

results in a dramatic change in body area, length and

The spawning behaviors of male and female

overall shape. The time individual spawning phases

worms are important for successful fertilization

take varies and depends on the worms and their

of the gametes.

The spawning process consists

willingness to engage. Some worm pairs are better

of four general phases:

Pre-spawning, engaged

matches than others which can result in shorter

spawning,

gamete

release

and

post-spawning.

spawning phases.

During pre-spawning, male and female worms

typically swim independently of one another, usually

Our goal is to analyse these spawning behaviours in a

with lower speeds, and display a linear body shape.

quantitative manner, and to characterise and compare

Engaged spawning is initiated when male and female

male and female-specific spawning behaviours.

2. Task formulation

of the worm during the different spawning

phases.

The gross curvature of the worm’s

The aim is to develop methods that enable the

body in general provides information on the

tracking of spawning worms from captured videos

directionality of swimming.

For example,

and extract features to quantify behaviours. For the

a mostly straight linear profile would be

tracking, it is important that we distinguish male and

indicative of linear swimming, while smoothly

female worms in every frame of a captured video,

curved body profiles would indicate circular

label them and keep track of those labels. This paper

swimming. Good resolution of finer-scale body

focuses on the extraction of features for the analysis

curvatures along the length of the worm is also

of behaviours. The tracking task is simplified by

important. For example, a linear profile with

only considering videos with single worms.

In

several bends could indicate an acceleration

order to quantify behaviours, we currently extract the

of swim speed, or ’wriggling’ movements,

following worm features:

depending on the amplitude of the curvatures.

1. Skeleton

Such wriggling movements can be seen for

males when they are stopping to secrete sperm.

The skeleton describes the center line of a worm

Similarly, as gametes are released from the

and is defined by two endpoints and an ordered

tail, mapping fine-scale curvatures at the tip of

list of points between them. We use the skeleton

the tail could be used to map gamete release

to calculate the curvature of a worm and to

events, or characterize sex-specific gamete

generate a normalized shape representation.

release behaviours.

For example, we have

2. Head position

observed fast small tail flicks in males during

The head position is an important feature for

sperm release, and curling of the tip of the

the calculation of the velocity and the worm

tail in females just prior to egg release. The

trajectory.

We define it as an endpoint of

calculation of the curvature is based on the

the skeleton.

The tangent of the skeleton in

skeleton of the worm.

this endpoint can give us information on the

6. Normalized shape

orientation of the worm.

To choose the right endpoint, we currently

To make the comparison of different worms

select it at the beginning of a video and keep

(or of the same worm at different times in

track of that selection.

a video) easier, we create normalized shape

representations. To do this we follow a recent

3. Velocity

strategy which is known as co-registration

As the swimming speeds increase for both

where shapes are first straightened or flattened

sexes, the velocity is a good indication for the

to then register different views/deformations of

beginning of the engaged spawning.

the same normalized shape [1].

4. Trajectory of the worm head

7. Length and area

The mapping of the swimming trajectories gives

During the gamete release phase the body length

us information on the interaction between two

and area changes, especially for female worms.

worms. Furthermore, for individual worms, the

Therefore, these features are a good indicator

curvature of the trajectory can be compared to

for the beginning of this phase.

the curvature of the worm. A high correlation

indicates a circular movement and increases the

3. Existing tracking approaches

robustness of the curvature estimation.

The

trajectory can also give an indication on where

The tracking of animals and the extraction of

we can expect the worm to be in a following

features to quantify behaviours is not a new field

frame.

of application. Caenorhabditis elegans (C. elegans)

are roundworms that have been used as model

5. Curvature

systems in neuroscience for years and the demand

Measurements of body curvature tell us both

for robust computational methods has lead to a

about the gross and fine body movements

number of different tracking systems like Nemo





[10], OptoTracker [8] or a tracking system developed environment in the lab.

The single camera setup

by Chatenay and Schafer [2]. These worm trackers

has some limitations regarding 3D movements of the

are capable of tracking worms and extracting a

worms, as they might conceal parts of their body

variety of different features.

Unfortunately they

from the cameras viewpoint, resulting in a flawed

are developed for C. elegans worms who differ in

representation. Analysis of spawning videos have

their appearance as well as their locomotion from

shown, that the worms move horizontally near the

Platynereis dumerilii. Furthermore some of them are

water surface.

Therefore, we decided to use this

only capable of tracking single worms and others

single camera setup and neglect the few cases where

terminate the tracking of animals if they collide

the gathered data is flawed due to 3D movement.

and assigns new tracks after they separate again.

Although, we might change the setup in the future

This does not guaranty a continuous trajectory of a

using three cameras instead of one to solve the issue

single worm for a whole video sequence, which is

with the 3D movement.

an important requirement for our behaviour analysis.

Other animal tracking projects like AnTracks

5. Segmentation and tracking

(www.antracks.org) or ”Visual Ants Tracking” by

Basically,

male and female worms can be

Ying [11] are capable of tracking animals, but do not

distinguished by their color and anterior / posterior

allow the extraction of features, which match our

segment border, which can be seen in Figure 3.

requirements.

Therefore,

we propose a new system that is

capable of tracking Platynereis dumerilii worms and

offers feature extraction including a new method to

compute normalized shape forms.

4. Experimental setup

The setup of our worm tracker consists of a

light-tight box, a mounted infrared camera and an

ordinary PC to capture the videos. The worms are

placed inside a spherical bowl we refer to as arena.

Figure 2 shows the arena with two worms.

Figure 3. Image of a female (top) and a male (bottom)

worm with their segment borders (Scale in cm).

The segment border divides a worm into a head

and a tail part and the position of the border is

different for male and female worms. Relative to

their whole body length, male worms have a longer

tail than female worms.

Therefore, the segment

border is closer to the head.

Unfortunately, the

segment border is not always clearly visible.

Figure 4 shows three frames of the same worm

in the same video just a few seconds apart. These

Figure 2. Image of the arena with two worms taken from

frames illustrate the problem with the segment

a captured video.

border.

The worms tend to turn sideways when

moving fast and in such cases the segment border

The camera takes videos at a size of 1280x960

is not visible to the camera. This prevents us from

pixels with 60 frames per second.

The infrared

using the segment border as a feature to distinguish

camera is important as the spawning in nature

male and female worms.

occurs at night and we want to reproduce this





one region for a single worm.

Unfortunately, as

the worm produces some noise when moving in the

arena (particles or bubbles in the water, reflections

on the edge of the arena) we also get some noise

Figure 4. Three different frames of a single worm taken

in our binary image. Therefore, we only consider

from the same video just a few seconds apart. In the first

regions whose area is above a given threshold as

frame on the left the worm turned sideways, therefore the

worms. As the regions generated by noise are very

segment border is not visible.

small, this approach works very well in our current

setup.

Furthermore, due to the infrared capture, we do

6. Feature extraction

not have color information in the captured videos

and the available grayvalues are not distinctive

Features are extracted for every frame of the

enough to distinguish male and female worms.

captured video and are based on the binary region

and/or the skeleton of a worm.

Therefore, we choose an approach that does not

6.1. Skeleton

rely on the shape and color of the worms, but on their

Given the binary region of the worm, we use

continuous motion over time.

First, we label the

morphological thinning to compute the skeleton.

worms at the beginning of a captured video. Then,

In our case this approach is superior to the

we calculate the distance between the head positions

morphological skeletonization with the medial

in consecutive frames and assign the label based

axis transfrom algorithm as the latter tends to

on the smaller deviation.

This approach already

generate more spurious branches. See Figure 5 for

works well for single worms, but it is too simple to

a comparison between the two approaches for a

track pairs of worms, as they tend to overlap and

sample worm. The thinning approach also tends to

the distance of head positions alone is not a robust

create a smoother skeleton.

criteria.

In this paper, we focus on tracking of single

worms. We will extend our approach to setups with

worm pairs in the future. Although, we only track

single worms at the moment, it is still possible to

analyse separate spawning behaviours in male and

female worms, as we add eggs or sperm manually to

the arena and the worms react to them. This allows

us to analyse isolated spawning behaviours.

To track a single worm we first need to segment

it from the background. We do this with a simple

background subtraction for every frame of the video.

For the subtraction, it is important that there is at

least one frame at the beginning of the video with an

Figure 5. Illustration of the worm skeletons (white)

empty arena, which serves as the background image.

computed from the binary segmentation image (outlined

As this image serves as the background image for

by the red line).

The left skeleton was computed

the whole video, it is assumed that the arena does

using the morphological thinning, the right one using

not move during the video.

skeletonization (MAT) technique.

After the background subtraction the resulting

The skeleton is defined as an 8-connected curve

image is converted to a binary image, based on

s = hp1, ..., pni where pi = (xi, yi) with i = 1, ..., n.

a global threshold.

The binary image gives us a

We order the points pi of the skeleton from head to

collection of regions that correspond to changes in

tail by comparing the endpoints of the skeleton in one

relation to the empty arena. Ideally there is only

frame with the endpoints in the previous frame. The



position of the head in the first frame of the video has

to calculate the curvature at pi. See Figure 6 for a

to be specified by the user.

visualization.

6.2. Head position

As we store the skeleton points in a head to tail

order, we get the head position from the first point

p1 = (x1, y1) in s.

6.3. Trajectory of the worm head

Given a list of head positions h = hh1, ..., hni

where hi = (xi, yi) with i = 1, ..., n we can

generate a connected trajectory.

Furthermore, its

curvature can be computed to provide information on

the swimming direction. Figure 11 shows a section

of the trajectory from a single worm video.

6.4. Velocity

Velocity is defined as the rate of change of

position with respect to time. We get the change of

position by considering the head positions of a single

worm in two consecutive frames and calculating the

Euclidean distance between the two positions. The

Figure 6. Illustration of the circumscribed circle (blue) for

time between two frames is given by 1/f ps, where

a single point pi on a skeleton. The circle passes through

f ps is the number frames per second of the video

every vertex of the triangle (red) formed by the points

source.

pi−k, pi and pi+k with k = 10.

6.5. Curvature

The radius of the circumscribed circle is defined

According to Hermann and Klette [6], the

as radius =

abc

, where a, b and c correspond to

4∗area

estimation of the curvature along a discrete curve

the edge lengths of the triangle and area is the area

can roughly be divided into three categories: the

of the triangle. The area of a triangle is given by

derivative of the tangent angle, the derivative of the

area = abs( 1 ∗ determinant) where determinant

2

curve and the radius of the osculating circle. We

refers to the determinant of the triangle matrix, which

chose a method based on osculating circles as it

is formed from the three triangle points:

is fast and the implementation is simple. Gray [5]

x





1

y1 1

defines the osculating circle of a curve C at a given

determinant =



x2

y2 1

point P in the continuous space as the circle that





x3

y3 1

has the same tangent as C at point P as well as the

same curvature. We approximate these circles with

As the sign of the determinant gives an indication

the circumscribed circles of triangles on the discrete

on the orientation of the triangle and therefore an

skeleton curve. Casey [3] defined the circumscribed

indication on the direction of the curvature, we don’t

circle as the unique circle that passes through each

use the absolute value. So we define the area as

of the triangle’s three vertices.

area = 1 ∗ determinant. With this information

2

the radius is then defined as radius =

abc

.

2∗determinant

Given the definition for the skeleton s at the

The curvature is given by the inverse of the radius

beginning of this section, let k be 1 ≤ k ≤ n

c =

1

. As we do not take the absolute value

2

radius

if n is odd and 1 ≤ k ≤ ( n − 1) if n is even,

of the determinant when calculating the radius, the

2

where n is the number of points on the skeleton.

curvature is a signed value that is positive if the

For each point pi on s we define a triangle between

curvature is on the right side and negative if the

the three points pi−k, pi and pi+k. Then the radius

curvature is on the left side of the skeleton curve. See

of the triangles circumscribed circle is computed

Figure 7 for a visualization.





small curvatures are overlooked in the process if

k is too big.

Another problem with a fixed

neighbourhood k are points at the beginning and the

end of the skeleton curve. For points pa with a ≤ k

there are no neighbourhood points pa−k defined, as

the index would become null or negative. The same

is true for points pb with b > n − k where no

neighbourhood points pb+k are defined, as the index

would get bigger than n. We currently solve this

problem by disregarding those points on the curve. In

Figure 7 the curvature values always start at the index

1 + k and end at the index n − k. Another problem is

the determination of a good value for the parameter

Figure 7. Image of a worm with its skeleton (top) and a

k. In Figure 7 the blue line shows the curvature for

plot of the estimated curvature of the worm for different k

k = 17 which equals 0.15 ∗ n and gives the best

(bottom).

results on the tested worms.

6.6. Normalized shape

An important factor in the accuracy of this

algorithm is the parameter k that defines a

We achieve the normalized shape representation

neighbourhood around the point of interest on

of a worm with a backward medial axis transform

the curve. We tested the accuracy on a discrete circle

approach.

The starting point is the distance

with a radius of 40 pixels using Bresenham’s circle

transform of the binary worm image which labels

algorithm. The results can be seen in Figure 8. The

each pixel with the Euclidean distance to the nearest

Parameter k starts at 0.05 ∗ n as the error gets too big

boundary in the binary image. For every point pi

for smaller values. As k increases we can see that

of of sorted list s of skeleton points, we use the

the error gets smaller. The same is true for a constant

coordinates to look up the distances in the distance

k but an increasing radius, which corresponds to the

transform. Those distances then serve as the radii for

multigrid convergence theorem, where we expect the

the circles. See Figure 9 for a visualization.

accuracy to increase as the grid resolution (or in our

case the circle radius) increases [7, Chapter 10].

Figure 9. Part of the distance transform of a worm with

circles drawn for four points on the skeleton.

Figure 8. Plot of the avg- and max-error for the curvature

estimation of a circle with radius 40 and increasing k.

To get a suitable representation of the worm, the

distances between the skeleton points in the video

So the accuracy gets better with increasing k.

frame need to stay the same on the normalized shape

Unfortunately, this accuracy comes at a price, as

representation.

Therefore the Euclidean distance





between the points is calculated and taken into

7. Single worm experiments

account when drawing the circles.

Figure 10

Some experiments with single female worm

shows the results of this method, where in the first

videos were conducted. Figure 12 shows two plots

visualization only the outlines of a few circles are

of smoothed worm lengths.

For the smoothing a

drawn to show the general idea behind this approach.

moving average filter was applied to the original data.

The plots show the length of the worms around the

time of the gamete release where the female worms

secrete their eggs into the water and get smaller and

therefore shorter. This can also be observed in the

plots.

Figure 10. Plots of the normalized representation of a

worm using only the outlines of 24 circles to visualize the

general idea (top) and a complete shape visualization with

all 115 filled circles (bottom) for that worm.

6.7. Length

To calculate the length of a worm, we use the

geodesic distance of its skeleton plus the radii of the

circles at the first and last skeleton point. The circle

radii are needed as our skeleton endpoints do not lie

at the edge of the worm. The geodesic distance is

computed using the Euclidean distance.

6.8. Area

For the area of a worm, we simply calculate the

sum of all foreground pixels of the binary image of

the worm, which is the zeroth moment.

Figure 12. Two plots of smoothed worm lengths for

two different female worms right around the time of the

gamete release (marked in red).

Figure 13 shows how the length of a female

worm changes during an entire spawning process.

Annotation A marks a special case where the worm

is overlapping itself resulting in a faulty binary area

Figure 11. Part of a video frame with an overlay of the

and skeleton. The problem here is the 3D movement

head trajectory.

The trajectory is taken from a short

of the worm.

sequence of a single worm video.



Figure 13. Change of worm length over time. During the gamete release the worm gets shorter. Annotation A: Wrong skeleton due to 3D movement of the worm. Annotation B: Wrong length due to 3D movement of the worm.

Another special case where the 3D movement also

extracted features are flawed as well. One approach

results in error-prone data is marked with annotation

will be to look into the watershed method to segment

B. Here the end of the tail is not visible to the camera

the worms as it might be superior to the simple

which makes the worm appear shorter in the video.

threshold based method we use now, especially for

worms that overlap.

8. Conclusion and Future work

Our approach to compute the curvature also

In this paper we proposed a novel worm tracking

has some flaws and is not robust enough. In the

approach for Platynereis dumerilii worms, that

future we will look into alternative approaches to

enables both tracking and feature extraction from

compute the curvature of discrete curves.

Other

captured videos. Although our tracking approach is

methods that try to estimate the osculating circles

not suitable for tracking two worms in difficult cases

rely on digital straight segments (DSS) recognition

our methods to extract worm features already show

[6] [4] and Roussillon and Lachaud [9] base their promising results.

method around maximal digital circular arcs.

The method we currently use to track single

9. Acknowledgement

worms works for two worms if they are physically

The research leading to these results has

separated, but as they get close to each other or

received funding from the European Research

overlap, the current method might fail. In the future,

Council under the European Community’s Seventh

we will extend the method to consider cases where

Framework

Programme

(FP7/2007-2013)/ERC

the worms are close to each other or even overlap.

Grant Agreement 337011 to KT-R.

Ideas to achieve this include the comparison of more

features than just the head positions of consecutive

We also thank the anonymous reviewers for

frames. A combination of all other features could

their valuable input.

yield an appropriate approach in distinguishing male

and female worms.

References

[1] N. Aigerman, R. Poranne, and Y. Lipman. Lifted

The current feature extraction is robust in most

bijections for low distortion surface mappings. ACM

cases, but there exist special cases, where a single

Trans. Graph., 33(4):69:1–69:12, July 2014. 2

worm overlaps itself due to 3D movement in the

[2] J.

B.

Arous,

Y.

Tanizawa,

I.

Rabinowitch,

water. This results in regions and skeletons, which

D. Chatenay, and W. R. Schafer.

Automated

do not represent the worm correctly and therefore the

imaging of neuronal activity in freely behaving

caenorhabditis elegans.

Journal of Neuroscience

Methods, 187(2):229 – 234, 2010. 3

[3] J. Casey.

A sequel to the first six books of the

elements of Euclid, containing an easy introduction

to modern geometry, with numerous examples. 7th

edition, revised and enlarged.

Dublin. Hodges.

University Press Series. 184 S, (1895). 5

[4] D. Coeurjolly, S. Miguet, and L. Tougne. Discrete

Curvature based on Osculating Circles Estimation,

2001. 4th International Workshop on Visual Form

2001, Capri, Italy, Springer Lecture Notes in

Computer Science 2059, pages 303-312. 8

[5] A. Gray. Modern Differential Geometry of Curves

and Surfaces with Mathematica. CRC Press, Inc.,

Boca Raton, FL, USA, 1st edition, 1996. 5

[6] S. Hermann and R. Klette.

A comparative study

on 2d curvature estimators. In Computing: Theory

and Applications, 2007. ICCTA ’07. International

Conference on, pages 584–589, March 2007. 5, 8

[7] R. Klette and A. Rosenfeld.

Digital Geometry.

Morgan Kaufman Publishers, 2004. 6

[8] D. Ramot, B. E. Johnson, T. L. Berry, Jr, L. Carnell,

and M. B. Goodman. The parallel worm tracker:

A platform for measuring average speed and

drug-induced paralysis in nematodes. PLoS ONE,

3(5):e2208, 05 2008. 3

[9] T. Roussillon and J.-O. Lachaud.

Accurate

curvature estimation along digital contours with

maximal

digital

circular

arcs

.

In

14-th

International Workshop on Combinatorial Image

Analysis (IWCIA), LNCS, pages 43–55. Springer,

2011. 8

[10] G. D. Tsibidis and N. Tavernarakis.

Nemo:

a computational tool for analyzing nematode

locomotion. BMC Neuroscience, 8(1):1–7, 2007. 3

[11] F. Ying. Visual ants tracking. PhD thesis, University

of Bristol, 2004. 3

21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

Significance of Colors in Texture Datasets

Milan Šulc, Jiř´ı Matas

Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Cybernetics, Center for Machine Perception,

Technická 2, 166 27 Praha 6, Czech Republic

{sulcmila,matas}@fel.cvut.cz

Abstract.

This paper studies the significance of

for texture recognition contain color information, we

color in eight publicly available datasets commonly

decided to evaluate the accuracy of color-statistics

used for texture recognition through the classification

based methods to measure the significance of color

results of ”pure-color” and ”pure-texture” (color-

information in the datasets.

less) descriptors. The datasets are described using

The first contribution of this paper is a study of

the state-of-the-art color descriptors, Discrimina-

the significance of color information in available

tive Color Descriptors (DD) [15] and Color Names

datasets commonly used for evaluation of texture

(CN) [28]. The descriptors are based on partition-

recognition methods. In total we evaluate 8 texture

ing of the color space into clusters and assigning

datasets, namely FMD (Flickr Material Database),

the image probabilities of belonging to individual

ALOT (A Lot Of Textures), KTH-TIPS (Textures

clusters. We propose a simple extension of the DD

under varying Illumination, Pose and Scale), KTH-

and the CN descriptors, adding the standard devia-

TIPS2a, KTH-TIPS2b, CUReT (Columbia-Utrecht

tions of color cluster probabilities into the descrip-

Reflectance and Texture), VehApp (Vehicle Appear-

tor. The extension leads to a significant improvement

ance) and AniTex (Animal Texture).

in recognition rates on all datasets. On all datasets

The second contribution of the paper is an im-

the 22-dimensional improved CNσ descriptor outper-

provement of the state-of-the-art color descriptors,

forms all original 11-, 25- and 50-dimensional de-

Discriminative Color Descriptors (DD) [15] and

scriptors. Linear combination of the state-of-the-art

Color Names (CN) [28]. DD and CN are based on

”pure-texture” classifier with the CNσ classifier im-

partitioning of the color space into clusters and as-

proves the results on all datasets.

signing each color the probabilities of belonging to

individual clusters. Our extension to the DD and

the CN descriptors adds the standard deviation for

1. Introduction

each color cluster to the descriptor. This leads to

Visual recognition based on texture and color are

an improvement in recognition rates on all 8 tested

well established computer vision disciplines with

datasets, as shown in the experiments in Section 5.

several surveys available, e.g. [3, 10, 19, 20, 27, 30].

The third contribution of the paper are experi-

The state-of-the-art in texture recognition has been

ments combining a state-of-the-art ”pure-texture” de-

recently dominated in terms of accuracy by meth-

scriptor with the improved CNσ descriptor, leading to

ods based on deep Convolutional Neural Networks

further increase in recognition accuracy.

(CNNs) [5, 6], yet the pre-CNN approaches may be

The rest of the paper is organized as follows: Sec-

preferable in real-time applications for their perfor-

tions 2.1 and 2.2 review the state of the art in tex-mance without parallel processing. Although it has

ture and color recognition, respectively. The selected

been shown that several texture description meth-

”pure-color” image descriptors and our extension to

ods can benefit from adding color information [13],

them are introduced in Section 3. Publicly avail-

a large number of the pre-CNN texture recognition

able color-image databases commonly used for tex-

techniques has been evaluated only on gray-scale im-

ture classification are described in Section 4. Section

ages. Since many publicly available datasets used

5 describes the experiments and presents the results.

The observations are discussed and conclusions are

simpler color statistics, not making use of spatial in-

drawn in Section 6.

formation.

Standard approaches to collect color information

2. State of the Art

include color histograms (based on different color

representations), color moments and moment invari-

2.1. Texture-Based Classification

ants. Sande et al. [27] provide an extensive eval-

A large number of texture recognition techniques

uation of such descriptors. The Color Names (CN)

has been proposed, many of them being described in

descriptor by Weijer et al. [28] is based on models

the surveys [3, 19, 20, 30]. In this section we only

learned from real-world data obtained from Google

review the recent development and the state-of-the-

by searching for 11 color names in English. The

art.

Color Names have shown to be a successful color at-

Several recent texture recognition algorithms re-

tribute for object detection [12] and recognition [14].

port excellent results on standard datasets while ig-

The model assigns each pixel the probability of be-

noring the available color information. A number

longing to one of the 11 color clusters. A similar ap-

of them is based on the popular Local Binary Pat-

proach is used by the Discriminative Color Descrip-

terns, such as the Pairwise Rotation Invariant Co-

tor (DD) of Khan et al. [15], where the color values

occurrence Local Binary Pattern of Qi et al. [22] or

are clustered together based on their discriminative

the Fast Features Invariant to Rotation and Scale of

power in a classification problem with the objective

Texture of Sulc and Matas [26]. A cascade of in-

to minimize the drop of mutual information of the

variants computed by scattering transforms was pro-

final representation.

posed by Sifre and Mallat [24] in order to construct

Khan et al.

[13] study the strategies of com-

an affine invariant texture representation. Mao et al.

bining color and texture information. They carried

[18] use a bag-of-words model with a dictionary of

out a comparison of pure color descriptors on the

so called active patches: raw intensity patches that

publicly available KTH-TIPS2a, KTH-TIPS2b, and

undergo further spatial transformations and adjust

FMD datasets, and on another small dataset denoted

themselves to best match the image regions. While

as Texture-10.

Since the results of Color Names

the Active Patch Model doesn’t use color informa-

and Discriminative Color Descriptors outperformed

tion, the authors claim that adding color will fur-

other color descriptors in texture classification, we

ther improve the results.

Cimpoi et al.

[4], us-

will describe the usage of CN and DD in more detail

ing Improved Fisher Vectors (IFV) for texture de-

in Section 3 and use the models in our experiments

scription, show further improvement when combined

in Section 5.

with describable texture attributes learned on the De-

scribable Textures Dataset (DTD) and with color at-

3. Selected Color Descriptors

tributes.

Based on the findings of Khan et al. [13] and on

Recently, Cimpoi et al. [5, 6] pushed the state-

our preliminary results, we consider the Color Names

of-the-art in texture recognition using a new encoder

[28] and Discriminative Color Descriptors [15] the denoted as FV-CNN-VD, obtained by Fisher Vector

best match for our experiments for their superior

pooling of a very deep Convolutional Neural Net-

classification accuracy.

work (CNN) filter bank of Simonyan and Zisser-

While each of the approaches creates the color

man [25]. The CNN filter bank operates on (pre-

models based on a different criteria, the result is

processed) RGB images. The method achieves state-

a soft assignment of clusters to each RGB value.

of-the-art accuracy, yet may not be suitable for real-

In both cases the assignment is performed using

time applications when evaluated without a high-

a lookup table, which creates a mapping from

performance GPU.

RGB values to probabilities over C clusters ci, i.e.

p (c

2.2. Color Statistics for Classification

i | x). In this work we use the lookup tables pro-

vided by the authors of the methods, i.e. the 11-

Color information is processed by many state-of-

dimensional Color Names representation by [28] and

the-art descriptors in Computer Vision, including the

the universal color 11-, 25- and 50-dimensional rep-

neurocodes of Deep CNNs or different extensions of

resentations by [15].

SIFT incorporating color. Yet we are interested in

The models assume uniform prior over the color





names p(ci). The conditional probabilities for each

4. Color Texture Datasets

cluster ci given an image I are computed as an aver-

This section reviews publicly available texture

age over all N pixels xn in the region:

datasets that contain color information. Databases

1

available only in the gray-scale version, such as Bro-

X

p (ci | I) =

p (c

N

i | xn)

(1)

datz, UIUCTex or UMD, are omitted.

xn∈I

4.1. CUReT

The standard descriptor D for image I is then a vec-

tor containing the probability of each cluster:

The Columbia-Utrecht Reflectance and Texture

(CUReT) image database [8] commonly used for tex-

 p (c



1 | I )

ture recognition1 contains 5612 images of 61 classes.

 p (c2 | I ) 

There are 92 images per class, with different combi-

D(I) = 

.



(2)



.

nations of view- and illumination-direction.

.







p (c

The standard experimental protocol divides the

C | I )

dataset into two halves, using 46 training images per

We propose to add another statistics to the color

class for training and 46 images for testing. Exam-

descriptor, the standard deviation of the color cluster

ples of four selected classes from the dataset are dis-

probabilities in the image:

played in Figure 1.

4.2. KTH-TIPS

s 1 X

σ(c

The Textures under varying Illumination, Pose

i | I ) =

[p (c

N

i | xn) − p (ci | I )]2

and Scale (KTH-TIPS) database [9, 11] was col-

xn∈I

(3)

lected by Fritz, Hayman and Caputo with the aim

We concatenate the standard deviations to the

to supplement the CUReT database, concerning tex-

original descriptor to get the extended representation:

ture variations in real-world conditions. The dataset

contains 81 images for each of 10 selected materials,

 p (c



taken with different combination of pose, illumina-

1 | I )

tion and scale. The dataset contains samples of dif-

 p (c2 | I ) 



.



ferent color for several materials, each of the samples



..







appears several times. In the experimental protocol

p (c



Dσ(I) = 

C | I )

the dataset is randomly divided into halves, 40 im-





(4)

 σ(c1|I ) 





ages per class are used for training and the remaining

 σ(c2|I ) 



41 images are used for testing. It is thus probable,

.





.





.



that each of the samples appear in the training data

σ(cC|I)

set.

4.3. KTH-TIPS2

The KTH-TIPS2 database [2, 17], gathered by

Mallikarjuna, Targhi, Hayman and Caputo, largely

followed the procedure used for the previous KTH-

TIPS database, with some differences in scale an il-

lumination. The database also contains images from

the previous KTH-TIPS dataset. The objective of the

database is to provide a better means of evaluation: It

contains 4 physical samples for each of 11 materials

(a) Felt

(b) Polyester (c) Lettuce

(d) Corn

and images of no physical sample are present in both

leaf

husk

training and test set. The database contains 108 im-

Figure 1: Examples of four texture classes from the

ages of each physical sample. There are two version

CUReT database.

of the database: KTH-TIPS2a and KTH-TIPS2b. In

1http://www.robots.ox.ac.uk/ vgg/research/texclass/setup.html





(a) Fabric

(b) Foliage

(c) Glass

(d) Stone

Figure 4: Examples of four texture classes from the

FMD database.

ber of materials is much higher: it contains 250 tex-

ture classes, 100 images per class. The pictures were

(a) Corduroy

(b) Lettuce

(c) Wood

(d) Wool

taken under various viewing and illumination direc-

tions and illumination colors. For evaluation, 20 im-

Figure 2: Examples of four texture classes from the

ages per class are used for training and the remaining

KTH-TIPS2 database. Each image belongs to a dif-

80 images per class are used for testing. Examples

ferent physical sample.

from the ALOT database are displayed in Figure 3.

4.5. FMD

the KTH-TIPS2a dataset, 144 images are are miss-

ing (namely there are four samples with only 72 im-

The Flickr Material database (FMD) was devel-

ages). In the experimental protocol, three samples

oped by Sharan et al. [23] with the intention of cap-

from each class form the training set and the remain-

turing a range of real world appearances of com-

ing sample is used for testing. In the case of the

mon materials. The dataset contains 1000 images

KTH-TIPS2b dataset, one sample forms the training

downloaded manually from Flickr.com (under Cre-

set and the remaining three form the test set. Exam-

ative Commons license), belonging to one of the fol-

ples from all four samples of four selected classes

lowing materials: Fabric, Foliage, Glass, Leather,

from the database are displayed in Figure 2.

Metal, Paper, Plastic, Stone, Water or Wood. There

are exactly 100 images for each of the 10 material

4.4. ALOT

classes. Unlike the dataset described above, FMD

The Amsterdam Library of Textures (ALOT) [1] is

was not primarily created for texture recognition, and

similar in spirit to the CUReT dataset, yet the num-

it includes images of objects with various textures

for each material. The dataset also includes binary

masks for background segmentation. The standard

evaluation protocol divides the images in each class

into two halves, 50 images for training and 50 for

testing. Examples from the FMD dataset are dis-

played in Figure 4.

4.6. AniTex

The Animal Texture dataset (AniTex) constructed

(a) Fruit

(b) Pepper

(c) Color

(d) Macaroni

by Mao et al. [18] contains 3120 texture patch im-

sprinkles

(red)

calibration

ages cropped randomly from the torso regions in-

checker

side the silhouettes of different animals in the PAS-

CAL VOC 2012 database. There are only 5 classes

Figure 3: Examples of four texture classes from the

(cat, dog, sheep, cow and horse), 624 images each.

ALOT database.

The authors created the dataset to explore less ho-





(a) Cat

(b) Dog

(c) Sheep

(d) Cow

(a) Plane

(b) Bicycle

(c) Bus

(d) Car

Figure 5: Examples of four texture classes from the

Figure 6: Examples of four texture classes from the

AniTex database.

VehApp database.

mogeneous texture and appearance than available in

The multiclass classification is then performed for

standard texture datasets. The patches in the dataset

each descriptor separately by combining binary SVM

come from images under different conditions such as

classifiers in a One-vs-All scheme.

Linear SVM

scaling, rotation, viewing angle variations and light-

classifiers were used together with an approximate

ing condition change. For evaluation, the dataset is

feature map of Vedaldi and Zisserman [29].

The

randomly divided into 2496 training and 624 testing

χ2 kernel approximations and the histogram inter-

images. Examples from the AniTex dataset are dis-

section kernel approximations were considered, the

played in Figure 5.

latter was chosen based on slightly superior perfor-

mance in preliminary experiments. The Platt’s prob-

4.7. VehApp

abilistic output [16, 21] was used in order to estimate

The Vehicle Appearance dataset (VehApp) was

the posterior class probabilities to choose the result

created by the same authors as AniTex [18] with the

in the One-vs-All scenario. To minimize the effect of

same intentions. It contains 13723 images cropped

the random splits into training and testset, each ex-

from PASCAL VOC images containing vehicles of

periment is performed 10 times on a different split,

6 classes (aeroplane, bicycle, car, bus, motorbike,

with the exception of the KTH-TIPS2 databases with

train). The images are evaluated in a way similar to

4 experiments based on the material samples.

AniTex: 80% images are randomly chosen into the

All 8 color descriptors are compared in terms of

training set, the remaining 20% is used for testing.

class recognition accuracy in Table 1. The best pub-

Examples from the VehApp dataset are displayed in

lished results of ”pure-texture” (color-less) methods

Figure 6.

and the results of the state-of-the-art FV-CNN [5]

method are attached to the table for comparison.

5. Experiments

The comparison of the best ”pure-color” and ”pure-

We compute 8 descriptors for each image in every

texture” results on all 8 datasets is illustrated in Fig-

database: the standard 11-dimensional Color Name

ure 7.

descriptor CN and our extended 22-dimensional ver-

An experiment on combining efficient classifiers

sion CNσ; the 11-, 25- and 50- Discriminative Color

of ”pure-texture” and ”pure-color” was performed as

Descriptors DD11, DD25, DD50 and the extended

follows: Each image was described using the CNσ

versions DD11σ, DD25σ, DD50σ of double dimen-

color descriptor (using the same method as above)

sionality.

and the Ffirst [26] texture descriptor (with nconc = 3

Table 1: Recognition accuracy of selected color descriptors on publicly available databases commonly used for texture recognition.

CUReT

TIPS

TIPS2a

TIPS2b

ALOT

FMD

AniTex

VehApp

# classes

61

10

11

11

250

10

5

6

CN

85.9±0.6

99.3±0.9

46.7±2.0

39.0±2.5

51.0±0.5

26.3±2.4

38.0±2.0

34.7±1.0

DD11

68.7±0.9

95.5±1.3

43.5±6.5

36.1±1.0

38.2±0.4

24.0±1.1

32.4±1.6

33.2±1.0

DD25

83.4±0.8

96.8±0.9

44.0±7.6

36.0±2.3

60.9±0.5

23.9±1.4

36.0±1.7

36.9±0.6

DD50

87.7±1.0

99.0±0.7

46.9±4.8

38.5±1.5

65.5±0.4

22.6±1.4

37.4±1.1

39.1±1.0

CNσ

94.2±0.6

99.8±0.3

51.7±5.7

42.6±1.4

73.9±0.5

28.0±2.2

41.7±1.8

39.1±0.7

DD11σ

81.9±0.8

97.6±1.0

48.5±3.8

38.3±1.9

60.1±0.5

22.7±1.6

35.9±2.1

35.8±0.5

DD25σ

88.9±0.7

99.4±0.3

49.1±3.7

39.9±4.5

75.0±0.5

23.9±1.1

39.9±1.6

39.3±0.7

DD50σ

91.0±0.7

99.6±0.2

53.2±4.6

42.0±2.8

78.0±0.5

25.3±1.7

38.9±0.8

41.2±0.9

FV-CNN[5]

99.0±0.2

–

–

81.8±2.5

98.5±0.1

79.8±1.8

–

–

Pure-texture

99.8±0.1[24] 99.7±0.1[4]

88.2±6.7[26] 76.0±2.9 [26] 95.9±0.5 [26] 57.4±1.7[22] 50.8[18]

63.4[18]

descriptors per image, each describing c = 7 con-

opinion pool and the PROD scheme represents a log-

secutive scales). An approximate intersection kernel

arithmic opinion pool.

map is applied to both color and texture descriptors,

which are then classified using the One-vs-All Sup-

6. Observations and Conclusions

port Vector Machines with Platt’s probabilistic out-

A set of experiments with color-based image de-

puts. The final scores in Table 2 were then combined

scriptors was performed on 8 datasets commonly

using 3 axiomatic approaches, denoted as:

used for texture classification, leading to interesting

1. PROD: The dot product of both of the scores is

insights in color-based classification and in the un-

used for final decision.

derstanding of available texture-recognition datasets.

2. SUM: The sum of both of the scores is used for

One can see that using the simple color descrip-

final decision.

tors is sufficient for excellent results in specific cases,

such as the KTH-TIPS dataset, where materials of

3. SUM0.3: The weighted sum of both of the

the same color appear in both training and test data.

scores is used for final decision, where the

Satisfying results can also be obtained on the CUReT

weight of color is only 30% of the weight of tex-

and ALOT datasets. The KTH-TIPS2a and KTH-

ture, taking into account the lower performance

TIPS2b datasets are more difficult for ”pure-color”

of the color descriptors on most datasets.

classification, since testing data may come from sam-

In terms of combining probability distributions [7],

ples of different colors than training data, as illus-

the SUM and SUM0.3 schemes represent a linear

trated in Figure 2. The FMD, AniTex and VehApp

100

99.8

99.8

99.7

94.2

95.9

Color

88.2

Texture

80

[%]

78.0

76.0

60

63.4

57.4

accuracy

51.7

50.8

40

42.6

41.7

41.2

28.0

20

Recognition

0

CUReT

KTH-TIPS

KTH-TIPS2a

KTH-TIPS2b

ALOT

FMD

AniTex

VehApp

Figure 7: Comparison of the best published results of ”pure-texture” descriptors and the best results obtained using ”pure-color” descriptors.

Table 2: Recognition accuracy for combinations of ”pure-texture” (Ffirst) and ”pure-color” (CNσ) descriptors.

CUReT

TIPS

TIPS2a

TIPS2b

ALOT

FMD

AniTex

VehApp

# classes

61

10

11

11

250

10

5

6

CNσ

94.24±0.60

99.83±0.31

51.73±5.71

42.64±1.43

73.86±0.46

27.98±2.20

41.67±1.77

39.07±0.67

Ffirst

99.65±0.09

99.51±0.53

88.29±6.77

76.60±4.29

96.43±0.23

50.22±1.90

45.72±1.78

54.41±0.66

PROD

99.41±0.15

99.98±0.08

68.13±5.06

60.12±4.06

94.65±0.20

46.58±2.37

49.97±1.50

56.47±0.76

SUM

99.04±0.20

100.00±0.00

77.59±5.87

60.35±5.13

92.06±0.29

45.70±2.47

50.08±1.56

56.56±0.98

SUM0.3

99.68±0.12

99.85±0.26

88.76±6.40

77.17±4.23

97.05±0.14

52.24±1.68

48.99±1.83

56.62±0.92

datasets are quite difficult for their heterogeneous na-





Acknowledgements


ture, both in terms of texture and color. Yet the color

Milan Šulc was supported by CTU student

statistics might still provide useful information when

grant

SGS15/155/OHK3/2T/13,

Jiř´ı Matas

by

combined with other descriptors.

The Czech Science Foundation Project GACR

P103/12/G084.

An extension to the Color Names (CN) and Dis-

criminative Color Descriptors (DD) has been pro-

References

posed (denoted as CNσ, DDσ), significantly im-

[1] G. J. Burghouts and J.-M. Geusebroek. Material-

proving the recognition accuracy on all 8 tested

specific adaptation of color invariant features. Pat-

datasets. The comparison of Color Names (CN) and

tern Recognition Letters, 30(3):306–313, 2009. 4

Discriminative Color Descriptors (DD) descriptors

[2] B. Caputo, E. Hayman, and P. Mallikarjuna. Class-

brings a surprising observation: on 6 out of the 8

specific material categorisation. In Proc. of IEEE In-

texture datasets, Color Names outperform even the

ternational Conference on Computer Vision (ICCV),

higher-dimensional Discriminative Color Descrip-

volume 2, pages 1597–1604. IEEE, 2005. 3

tors DD25, although the opposite may be expected

[3] C.-h. Chen, L.-F. Pau, and P. S.-p. Wang. Handbook

of pattern recognition and computer vision. World

from the findings on different tasks [15]. The im-

Scientific, 2010. 1, 2

proved CNσ outperforms other ”pure-color” descrip-

[4] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and

tors on 5 out of 8 datasets, the best results on the

A. Vedaldi.

Describing textures in the wild.

In

remaining 3 datasets are achieved by the improved

Computer Vision and Pattern Recognition (CVPR),

DD50σ descriptor.

2014 IEEE Conference on, pages 3606–3613. IEEE,

2014. 2, 6

Combining a state-of-the-art ”pure-texture” clas-

[5] M. Cimpoi, S. Maji, I. Kokkinos, and A. Vedaldi.

sifier [26] with the ”pure-color” classifier of CNσ

Deep filter banks for texture recognition, de-

leads to an improvement on all 8 tested datasets. The

scription,

and segmentation.

arXiv preprint

arXiv:1507.02620, 2015. 1, 2, 5, 6, 7

weights of the classifiers in the combination should

[6] M. Cimpoi, S. Maji, and A. Vedaldi. Deep filter

be set according to the classifiers performance. Note

banks for texture recognition and segmentation. In

that by combining the classifiers a 100% accuracy

Proceedings of the IEEE Conference on Computer

was achieved on the KTH-TIPS. Significant im-

Vision and Pattern Recognition, pages 3828–3836,

provements are also achieved on the AniTex and Ve-

2015. 1, 2

hApp databases, where [26] performs rather poorly.

[7] R. T. Clemen and R. L. Winkler. Combining proba-

bility distributions from experts in risk analysis. Risk

The state-of-the-art ”pure-texture” and ”pure-

analysis, 19(2):187–203, 1999. 6

color” classifiers and their combinations obtain ex-

[8] K. J. Dana, B. Van Ginneken, S. K. Nayar, and J. J.

cellent results on simpler texture-recognition prob-

Koenderink. Reflectance and texture of real-world

surfaces. ACM Transactions on Graphics (TOG),

lems. They are outperformed by the recent FV-CNN

18(1):1–34, 1999. 3

model [5] in the more difficult tasks. Yet the low

[9] M. Fritz, E. Hayman, B. Caputo, and J.-O. Eklundh.

computational complexity of some ”pure-texture”

The kth-tips database, 2004. 3

and ”pure-color” descriptors is beneficial and their

[10] T. Gevers, A. Gijsenij, J. Van de Weijer, and J.-M.

performance may be still interesting for future works,

Geusebroek. Color in computer vision: fundamen-

e.g. when used in a cascade classification scheme

tals and applications, volume 23. John Wiley &

and followed by FV-CNN in case of ambiguity.

Sons, 2012. 1

[11] E. Hayman, B. Caputo, M. Fritz, and J.-O. Ek-

[25] K. Simonyan and A. Zisserman. Very deep convo-

lundh. On the significance of real-world conditions

lutional networks for large-scale image recognition.

for material classification.

In Computer Vision–

arXiv preprint arXiv:1409.1556, 2014. 2

ECCV 2004, pages 253–266. Springer, 2004. 3

[26] M. Šulc and J. Matas. Fast features invariant to ro-

[12] F. S. Khan, R. M. Anwer, J. van de Weijer, A. D.

tation and scale of texture. In L. Agapito, M. M.

Bagdanov, M. Vanrell, and A. M. Lopez. Color at-

Bronstein, and C. Rother, editors, Computer Vision–

tributes for object detection.

In Computer Vision

ECCV 2014 Workshops, Part II, volume 8926 of

and Pattern Recognition (CVPR), 2012 IEEE Con-

LNCS, pages 47–62, Gewerbestrasse 11, CH-6330

ference on, pages 3306–3313. IEEE, 2012. 2

Cham (ZG), Switzerland, September 2015. Springer

[13] F. S. Khan, R. M. Anwer, J. van de Weijer, M. Fels-

International Publishing AG. 2, 5, 6, 7

berg, and J. Laaksonen. Compact color–texture de-

[27] K. E. Van De Sande, T. Gevers, and C. G. Snoek.

scription for texture classification. Pattern Recogni-

Evaluating color descriptors for object and scene

tion Letters, 51:16–22, 2015. 1, 2

recognition. PAMI, 32(9):1582–1596, 2010. 1, 2

[14] F. S. Khan, J. Van de Weijer, and M. Vanrell. Mod-

[28] J. Van De Weijer, C. Schmid, J. Verbeek, and D. Lar-

ulating shape features by color attention for object

lus. Learning color names for real-world applica-

recognition. International Journal of Computer Vi-

tions.

Image Processing, IEEE Transactions on,

sion, 98(1):49–64, 2012. 2

18(7):1512–1523, 2009. 1, 2

[29] A. Vedaldi and A. Zisserman. Efficient additive ker-

[15] R. Khan, J. Van de Weijer, F. Shahbaz Khan,

nels via explicit feature maps. PAMI, 34(3), 2011.

D. Muselet, C. Ducottet, and C. Barat. Discrimina-

5

tive color descriptors. In Computer Vision and Pat-

tern Recognition (CVPR), 2013 IEEE Conference

[30] J. Zhang and T. Tan. Brief review of invariant texture

on, pages 2866–2873. IEEE, 2013. 1, 2, 7

analysis methods. Pattern recognition, 35(3):735–

747, 2002. 1, 2

[16] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on

platts probabilistic outputs for support vector ma-

chines. Machine learning, 68(3), 2007. 5

[17] P. Mallikarjuna, M. Fritz, A. Targhi, E. Hayman,

B. Caputo, and J. Eklundh. The kth-tips and kth-

tips2 databases. http://www.nada.kth.se/

cvap/databases/kth-tips, 2006. 3

[18] J. Mao, J. Zhu, and A. L. Yuille. An active patch

model for real world texture and appearance clas-

sification. In Computer Vision–ECCV 2014, pages

140–155. Springer, 2014. 2, 4, 5, 6

[19] M. Mirmehdi, X. Xie, and J. Suri. Handbook of tex-

ture analysis. Imperial College Press, 2009. 1, 2

[20] M. Pietikäinen. Texture recognition. Computer Vi-

sion: A Reference Guide, pages 789–793, 2014. 1,

2

[21] J. Platt.

Probabilistic outputs for support vector

machines and comparisons to regularized likelihood

methods.

Advances in large margin classifiers,

10(3), 1999. 5

[22] X. Qi, R. Xiao, C.-G. Li, Y. Qiao, J. Guo, and

X. Tang. Pairwise rotation invariant co-occurrence

local binary pattern.

PAMI, 36(11):2199–2213,

2014. 2, 6

[23] L. Sharan, R. Rosenholtz, and E. Adelson. Mate-

rial perception: What can you see in a brief glance?

Journal of Vision, 9(8):784–784, 2009. 4

[24] L. Sifre and S. Mallat. Rotation, scaling and de-

formation invariant scattering for texture discrimi-

nation. In Computer Vision and Pattern Recognition

(CVPR), 2013 IEEE Conference on, pages 1233–

1240. IEEE, 2013. 2, 6

21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

A Novel Concept for Smart Camera Image Stitching

Majid Banaeyan∗ , Hanna Huber∗ , Walter G. Kropatsch∗ and Raphael Barth+

Vienna University of Technology

∗Pattern Recognition and Image Processing group

{majid,hanna,krw}@prip.tuwien.ac.at

+Indiecam

raphael@indiecam.com

Abstract.

As panoramic images are widely used

further peripheral processing devices.

in many applications, efficient image stitching meth-

Common image stitching techniques take images

ods that provide visually pleasant image mosaics are

taken from different views and align them using

needed. In this paper we discuss a novel concept for

image registration in overlapping regions. So far, all

smart camera image stitching based on graph pyra-

images are collected and aligned centrally, which

mids. For a multi-camera system, the images have

suffers from high computational cost. Thus, we aim

to be aligned accordingly to create an image mosaic.

at parallelizing parts of this process by developping

Instead of calculating the corresponding transforma-

smart cameras that are able to perform some of the

tions centrally, we aim at enabling each camera to in-

image transformations themselves.

dividually calculate the transformation of the image

it takes. Graph pyramids used for image segmenta-

The camera systems we consider use fish-eye

tion provide information about the segmentation pro-

lenses. General camera models such as the pinhole

cess. We analyze how this information can be used to

model cannot be applied to these lenses, because they

calculate the transformations for image alignment.

do not conform to the perspective projection due to

their large field-of-view. Simple models are given for

different projections of ideal fish-eye lenses. They

1. Introduction

provide a formula for the radius r which is the dis-

Panoramic views form the basis of many applica-

tance between an image point and the principal point.

tions including augmented reality applications. Pro-

The principal point is the point where the optical axis

ducing video content with high quality seamless and

intersects the image plane. In case of the equidistant

artefact-free 360◦ of coverage is challenging per se

projection the radius is given by

and even more challenging if all related processing,

r = f θ,

(1)

especially seamless stitching, has to work automati-

cally and in real-time for live productions.

where f is the focal length and θ is the the incident

A suitable approach has to solve a system conflict

angle of the ray from an object point. However, this

between omnidirectional simultaneous video capture

formula does not reflect the behavior of real lenses.

on one hand, which cannot be done from the nodal

Instead, extended models are developped which take

point due to mechanical collision problems, and

into account the high level of distortion. Parameter

parallax-free stitching of panoramas without any par-

values are estimated using calibration, defining a

allax, ghosting and distortion artefacts on the other

final model for a particular camera [16]. In fish-eye

hand.

lenses, both radial as well as tangential distortion is

Using high on-board computing power of smart

present. While radial distortion reduces the spatial

cameras and a dedicated communication network be-

resolution towards the periphery of an image and

tween cameras could be used to integrate the entire

distorts rectilinear objects, geometric shifts are the

image processing for automatic real-time stitching

result of tangential distortion [15].

into the cameras themselves, avoiding the need for

In this paper, we define our concept of smart cam-

Additionally, there are some noticeable distortions

era image stitching and present ideas how to realize

caused by wide-angle optics such as distortion in the

it. We will first give an overview of related work

border regions of images in fish-eye lenses which

in the fields of fish-eye lenses, image stitching and

result in additional loss in image resolution.

smart cameras in section 2. After declaring our func-

tional goal in section 3, we discuss open problems

Smart camera networks have a wide range of

that we aim to solve in order to realize it and present

applications in various areas including surveillance

our ideas for possible solutions including novel ap-

systems, security monitoring, traffic control and

proaches in section 4. Finally, in section 5, we con-telemedicine [1]. For instance, Kawamura et. al [17]

clude our paper.

proposed a reliable surveillance system for railway

stations.

Their system tracks suspicious behavior

2. Related Work

by applying multiple camera fields of view. Smart

sensors communicate with each other over a wire-

In this section we present a selection of state-of-

less mesh network. Moreover, as an application for

the-art techniques in fields that are related to smart

the airport, Shirmohammadi et. al [31] introduced

camera image stitching.

a decentralized target tracking scheme. Smart cam-

2.1. Smart Cameras

era nodes automatically identify neighboring sensors

with overlapping fields and produce a communica-

The name of smart camera goes back to the

tion graph which reflects how the nodes will interact

middle of 1970s [29] when Ron Schneidermann

to fuse measurements in the network.

applied it in developing systems for controlling

the shutter. Then in 1981 the optical mouse was

2.2. Multi-View Setups and Image Stitching

invented by Richard Lyon [24, 25] which was the

first realized smart camera including an imaging

Moreover,

numerous publications deal with

device and embedded processing unit as a compact

panoramic images. For image registration, feature-

system.

”Smart camera is a label which refers

based methods which use distinct image points are

to cameras that have the ability to not only take

generally favored over area-based techniques which

pictures but also more importantly make sense of

compare images window by window [39]. Lowe et

what is happening in the image.” [4, Chapter 2,

al. [6, 21, 7] introduced scale-invariant feature points page 21] Smart cameras employ various concepts

(SIFT) which have been widely used since. They

of computer vision and machine vision which can

use a 128-dimensional feature vector. Ke and Suk-

extract useful information from images resulting

thankar [18] adopted this approach, but reduced the

in special decisions based on that information.

dimension of the descriptor to 36. Alternatively, Bay

Smart cameras can be classified into three main

et al. [3] presented a faster method based on Haar

categories including integrated, compact-system and

wavelets using speeded up robust features (SURF).

distributed smart cameras [4, chapter 2]. Integrated

All these features work well with standard per-

smart cameras can be further subdivided into three

spective projection since they are invariant to affine

types including single-chip [2, 11], embedded [20]

transformations and provide a sufficient number of

and stand-alone smart cameras. Distributed smart

corresponding points to recover the parameters of the

cameras involve some sort of networking and have

homography. Multi-view images taken from cameras

recently attracted significant interest in academic

at different positions lead to parallax errors. These

and industries fields [28]. Indeed, some problems

errors cannot be fully eliminated. Still, these effects

such as depth information in foreground detection

can be reduced. Global image transformations that

and occlusion are difficult to be solved by single

are calculated by fitting a homography to matched

smart cameras. In this case, using multiple cameras

feature points cannot handle parallax well. Zhang

with a powerful computing platform is an advantage.

and Liu [38] address this problem by combining

However, we encounter some physical limitations

the transformation using a homography with local

of the acquisition hardware.

Although current

content-preserving warping. The homography is no

professional cameras capture images at a horizontal

longer chosen as the best fir for all feature point pairs,

resolution of about 4k to 5k [27], they are insuffi-

but considers only neighboring feature points. Ad-

cient for large scales and wide-angle viewpoints.

ditionally, they use a tolerant fitting threshold. Per-

azzi et al. [27] describe an algorithm for generating level distortion, however, fewer terms are needed

videos from unstructured camera arrays. They apply

with the division model [10]. Based on this approach,

the basic concept of local warping to remove the par-

Aleman-Flores et al. [12] formulate a one-parameter

allax and define a new error measure with increased

model.

sensitivity to stitching artifacts. Their method tries to

In order to determine lens parameters, various

smooth out the blurring, ghosting and some other dis-

calibration procedures [37, 19, 32, 36, 34] have

tortions caused usually when videos which feed from

been developped. In many cases, they extract fea-

unstructured camera arrays are combined to create a

tures such as lines or corners from the image of a

single panoramic video. Deen et al. [9] create image

calibration pattern for which the world coordinates

mosaics for scientific purposes. Thus, they focus on

are known [15].

A self-calibration method based

correct rather than visually pleasant results. Parallax

on circle-fitting which does not require information

errors are reduced by performing pointing correction.

about the objects’ world coordinates is presented by

Existing tools for panoramic image stitching as

Bräuer-Burchardt and Voss [5]. However, the distor-

well as camera calibration include Hugin 1 and

tion of an image needs to exactly fit the chosen distor-

PTGui 2, which are both based on Panorama Tools 3

tion model. Aleman-Flores et al. [12] determine the

by Dersch.

distortion parameter automatically by introducing it

into Hough space and detecting distorted lines.

German et al.[14] investigate the application of

different map projections to panoramic images in-

cluding projections of fish-eye lens images. Multi-

3. Our Goal: a 360◦ Image Mosaic

view setups are addressed by Sturm et al.[33] who

develop a multi-view geometry model for central and

We consider a multi-camera system of small high-

non-central cameras based on structure-from-motion

quality cameras, in order to create a 360◦ image mo-

and by Luo et al.[23] who focus on saliency detection

saic. The system consists of six fish-eye lens cam-

in multi-camera setups.

eras. At this point we use the indieGS2K model pro-

duced by Indiecam4. Two adjoining cameras share

2.3. Fish-Eye Lenses

an overlapping region, respectively. Position and op-

tical parameters can be chosen arbitrarily, but will

Schwalbe [30] develops a geometric model for

be fixed for a specific system. Each camera creates

fish-eye lens cameras based on the approximately

an image using fish-eye projection. Additionally, it

linear relation between the incident angle of the

holds the information about the other cameras’ set-

ray from an object point and the distance from the

tings. In the end, an image mosaic using equirect-

corresponding image point to the principal point.

angular projection is created. This means that the

Distortion is accounted for by using conventional

horizon is a straight line in the middle of the image

distortion polynomials. Alternatively, Kannala and

and vertical lines in real world are vertical lines in

Brand [16] present a flexible camera model which

the image [14]. Before the actual stitching can be

is applicable for fish-eye as well as narrow-angle

performed, the respective images have to be trans-

lenses. They use a polynomial imaging function as

formed accordingly.

well as two additional terms for radial and tangential

distortion, respectively. The final camera model in-

Eventually, our goal is to develop the respective

cludes 23 parameters. It provides both a forward as

coordinate transformation model. For any two im-

well as a backward model. Moreover, Luhmann et

ages Ij and Ij+1 from the six cameras with overlap-

al. [22] deal with the correction of chromatic aberra-

ping view, a function Fj : Ij → Ij+1 has to be found

tion in fish-eye images. Standard distortion correc-

such that F (pj) = pj+1 for all corresponding pix-

tion methods use odd polynomial models as used by

els (pj, pj+1) in the overlapping region with pj ∈ Ij,

Mallon and Whelan [26]. These models describe the

pj+1 ∈ Ij+1. The resulting algorithm should take

distorted radius r

the image of one camera as well as the settings (posi-

d as a polynomial function of the

undistorted radius r

tion, optics) of the other as input.The output will be

u, using only odd terms. For high

the accordingly transformed image.

1http://hugin.sourceforge.net/

2https://www.ptgui.com/

3http://panotools.sourceforge.net/

4www.indiecam.com

4. A Novel Concept for Image Alignment

consistent image segmentation (SCIS) [8], the infor-

mation about the segmentation process is stored. As

At this point, the following problems have to be

this process is performed based on the structure of

solved in order to determine the image transforma-

the underlying image, it also contains information

tion:

about the distortion. A target coordinate system is

1. Calibrate the fish-eye lens and determine the

defined by the continuous curves of a checkerboard

distortion.

pattern which follow the isolines of the coordinate

system. By applying the segmentation to this pattern

2. Calculate the transformation from the fish-eye

and storing the details of the segmentation process,

projection to the equirectangular projection.

the distortion information of the coordinate system is

retrieved.

3. Perform a geometrical classification of possible

setups. Considering two cameras C1 and C2,

calculate critical points and distances in order

to distinguish between the following classes:

4.1.2

Features of the SCIS Algorithm

• region in which points can only be seen by

The SCIS algorithm segments an image based on Lo-

C1

cal Binary Patterns and the Combinatorial Pyramid.

• region in which points can only be seen by

It works on the local structure of the image and pre-

C2

serves structural correctness [8, Chapter 4, page 39]

• closer part of the overlapping region with

and topology of an image. For this purpose, five

visible parallax errors

topological classes based on Local Binary Patterns

of regions are applied which by combination with

• part of the overlapping region with negli-

the dual graph are able to remove redundant struc-

gible parallax errors

tural information. As a result, by using this approach

4. Calculate the coordinate transformation

the image graph will be simplified and connected re-

gions will be merged without introducing structural

In order to solve these problems described in

errors [8].

the previous section, we consider the following ap-

The SCIS algorithm performs image segmentation

proaches.

using a graph-based image representation. It pro-

4.1. Lens Calibration and Image Alignment using

vides the image at any level of segmentation as well

Graph Pyramids

as the information about the segmentation process up

to that level. The latter contains information about

While traditional lens and distortion models have

the distortion structure.

been studied extensively, we follow a different ap-

Initially, each pixel corresponds to a vertex and

proach. Our goal is to extract the distortion informa-

each edge to a neighborhood relation in the graph,

tion using graph pyramids.

which represents the base level of the combinatorial

pyramid. Subsequently, pixels are merged to regions

4.1.1

Overview

which are in turn merged to larger regions based on

Traditionally, lens calibration is based on a geometric

their intensity values. On higher levels, each ver-

model depending on parameters. The respective pa-

tex corresponds to an image region. Merging cor-

rameter values are determined during the calibration

responds to edge contraction and removal. The SCIS

procedure. This is a characteristic that previous lens

algorithm creates the entire pyramid as well as the

calibration methods have in common, even though

contraction history. The latter is represented by the

different models and procedures have been devel-

contraction kernels. Thus, it is able to reconstruct

opped. By establishing a model for which the pa-

the segmented image at any level. An example of a

rameters are specified, these methods already make

combinatorial pyramid is shown in Figure 1.

fundamental assumptions about the structure of the

An evaluation study of stereo matching by Joan-

distortion. On the contrary, we propose a calibration

neum Research [13] shows that the SCIS algorithm

method that determines the distortion including its

achieves the highest matching quality compared to

structure. In a graph pyramid as used for structurally

different compression methods.





Figure 3. Multi-camera calibration setup for six cameras

C1 - C6.

Figure 1. Example of a Combinatorial Pyramid. Image

taken from [8]

mogeneous regions.

Since they have all the same value it cannot be said

which edges are contracted or which are removed.

For making the process more precise we can consider

two solutions. One is to apply geometry of target

coordinates and perform linear interpolation. How-

ever, this approach has the drawback that we do not

know the size of the distorted patch, which is partic-

ularly problematic in our case where we expect se-

vere deformation. The second approach is to shift the

checkerboard pattern and create a new image from a

different viewpoint. By iteratively applying this pro-

cess, the regions inside the patches will be refined.

For instance, we can take M captures with different

Figure 2. Distorted checkerboard pattern with correspond-

offsets. Next, the idea is to freeze only the bound-

ing primal graph at the top of the pyramid. Each vertex

(yellow) corresponds to a patch. Vertices of the adjacent

aries of which we are sure that they are precisely de-

patches are connected by an edge (red).

lineated. Indeed, by taking two different positions

(randomly) and overlapping with the two contraction

kernels, both boundaries should be preserved. There-

4.1.3

Calibration Procedure

fore, the random space of patches will be smaller and

The canonical representation of the combinatorial

smaller as the process is used more and more.

pyramid stores it as a single array. The elements in

There are two ways for applying this strategy. On

this array are half-edges, called darts. They are or-

the one hand, it can be performed sequentially by

dered according to the contraction history. In order

freezing the contraction kernels corresponding to the

to extract information about the distortion from the

boundaries from the previous iteration. On the other

combinatorial pyramid, we consider the image of a

hand, it can be performed randomly. Given the con-

checkerboard pattern, where each patch is assigned

traction kernels at every point and knowing the po-

an absolute coordinate. At the top level of the pyra-

sition of a boundary, we can integrate the contrac-

mid, each vertex corresponds to a single patch (see

tion kernels using high weights at boundaries and low

Figure 2).

weights in between. For homogeneous regions, the

As a result we get the contraction history. The

contraction kernels provided by the shifting approach

top level delivers a single vertex for every patch of

will converge towards the proper kernel.

the checkerboard with its adjacency. All contracted

With the contraction kernels provided, the infor-

edges of a patch form a spanning tree of the corre-

mation about the distortion is stored implicitly, al-

sponding region in the primal graph. We do not know

lowing us to apply it to any new image. Conve-

anything about the contraction kernels inside the ho-

niently, the canonical representation stores this in-





Figure 5. Calibration pattern using a spherical target coor-

dinate system with radius r, azimuthal angle θ and eleva-

Figure 4. Calibration pattern using a cylindrical target

tion angle φ.

coordinate system with radius r, azimuthal angle θ and

height h.

4.2. Projection remapping

formation in an ordered array. Thus, the calibrated

The remapping from fish-eye to equirectangular

kernels which have to be applied to get to a particu-

projection can also be handled by the graph-based

lar level of the pyramid can be re-used.

calibration method presented in the previous section.

The calibration setup for a multi-camera system is

For comparison, it can be addressed individually fol-

illustrated in Figure 3.

lowing German et al. [14]. Information about the

camera’s roll, which is the rotation angle about the

optical axis, and pitch, which is the elevation angle

4.1.4

Advantages of Calibration using Graph

from the horizontal axis, allows the remapping from

Pyramids

a fish-eye to an equirectangular projection. Roll and

pitch can be determined manually or by using hori-

Apart from the fact that the graph-based approach

zontal or vertical control lines.

does not make any assumptions about the structure

of the distortion, it yields other advantages compared

to previous calibration methods. Accuracy can be in-

4.3. Setup Classification

creased simply and reached to the resolution of orig-

inal images by increasing the number of shifts. Ad-

The classification of the setup with regard to par-

ditionally, we do not need a global model of the geo-

allax errors can be performed using partial edge con-

metric projection for calibration, which is needed for

tours as used by Wang et al. [35]. The edge contour

many estimation methods of the parameters. Finally,

of an obstacle is mapped from one image to the other.

our method does not depend on a particular coordi-

The parallax is then calculated as the transverse dis-

nate system. Instead, any target coordinate system

tance between corresponding edge contour pixels.

can be chosen. It is defined by the checkerboard pat-

tern where the continuous curves correspond to the

4.4. Image Transformation

the isolines of a target coordinate system. Thus, var-

ious geometries can be used for this approach such

Similar to the projection remapping, the image

as cylindrical (see Figure 4) or spherical (see Fig-

tranformation used for image alignment can be de-

ure 5) coordinate systems. In particular, the coordi-

termined by the graph-based approach. For compar-

nate system of the final mosaic can be chosen as tar-

ison, the calibration of the multi-camera system can

get coordinate system. In this case, the transforma-

be performed using feature extraction and matching.

tion provided by the calibration method does not only

For this purpose, SIFT features [21] will be used. In

consider lens distortion, but also includes remapping

order to reduce parallax errors, the image transforma-

to equirectangular projection as well as image align-

tion will be calculated following the parallax-tolerant

ment, and this simultaneously for all six cameras.

approach used by Zhang and Liu [38].

5. Conclusion

[11] B. Flinchbaugh. Smart cameras systems technol-

ogy roadmap.

In B. Kisacanin, V. Pavlovic, and

We presented a novel concept for the smart cam-

T. Huang, editors, Real-Time Vision for Human-

era image stitching. It aims at reducing the cost of the

Computer Interaction, pages 285–297, 2005. 2

stitching process by enabling each camera of a multi-

[12] M. A. Flores, L. Á. León, L. G. Déniz, and D. E. S.

camera system to align the image that takes individ-

Cedrés. Automatic Lens Distortion Correction Us-

ually. Lens calibration can be performed using graph

ing One-Parameter Division Models. IPOL Image

pyramids, which yields several advantages compared

Processing OnLine (Special Issue on Lens Distor-

to traditional lens calibration methods. Additionally,

tion Models), 4:327–343, 2014. 3

the same method can be used to directly determine

[13] B. Froehlich and M. P. Caballo-Perucha. Evaluation

the image transformation required for image align-

of image compression algorithms version 1.0, issue

ment. Currently, the work is in progress, but in near

D1, 2015. Joanneum Research. 4

future we are planning to experimentally prove the

[14] D. M. German, P. d’Angelo, M. Gross, and B. Pos-

applicability of the proposed ideas.

tle. New Methods to Project Panoramas for Practi-

cal and Aesthetic Purposes. In D. W. Cunningham,

References

G. Meyer, and L. Neumann, editors, Computational

Aesthetics in Graphics, Visualization, and Imaging.

[1] I. F. Akyildiz, T. Melodia, and K. R. Chowdhury.

The Eurographics Association, 2007. 3, 6

A survey on wireless multimedia sensor networks.

[15] C. Hughes, M. Glavin, E. Jones, and P. Denny. Re-

Computer Networks, 51:921–960, 2007. 2

view of geometric distortion compensation in fish-

[2] L. Albani, P. Chiesa, D. Covi, G. Pedegani, A. Sar-

eye cameras. In IET Irish Signals and Systems Con-

tori, and M.Vatteroni. VISoc: A smart camera SoC.

ference, 208. (ISSC 2008), 2008. 1, 3

In Proceedings of the 28th European Solid-State Cir-

cuits Conference, pages 367–370, 2002. 2

[16] J. Kannala and S. S. Brandt.

A generic cam-

[3] H. Bay, T. Tuytelaars, and L. V. Gool. Surf: Speeded

era model and calibration method for conventional,

up robust features.

In In ECCV, pages 404–417,

wide-angle, and fish-eye lenses.

IEEE TRANS.

2006. 2

PATTERN ANALYSIS AND MACHINE INTELLI-

GENCE, 28:1335–1340, 2006. 1, 3

[4] A. N. Belbachir, editor. Smart Cameras. Springer,

2010. 2

[17] A. Kawamura, Y. Yoshimitsu, K. Kajitani, T. Naito,

[5] C. Bräuer-Burchardt and K. Voss.

A new algo-

K. Fujimura, and S. Kamijo. Smart camera network

rithm to correct fish-eye- and strong wide-angle-

system for use in railway stations. In SMC, pages

lens-distortion from single images.

Proceedings

85–90. IEEE, 2011. 2

2001 International Conference on Image Processing

[18] Y. Ke and R. Sukthankar. PCA-SIFT: A more dis-

2001, Vol.1, pp.225-228, 2001. 3

tinctive representation for local image descriptors.

[6] M. Brown and D. G. Lowe. Recognising panoramas.

In Proceedings of the 2004 IEEE Computer Society

In Proceedings of the Ninth IEEE International Con-

Conference on Computer Vision and Pattern Recog-

ference on Computer Vision - Volume 2, ICCV ’03,

nition, CVPR’04, pages 506–513, Washington, DC,

pages 1218–1225, Washington, DC, USA, 2003.

USA, 2004. IEEE Computer Society. 2

IEEE Computer Society. 2

[19] M. Kedzierski and A. Fryskowska. Precise method

[7] M. Brown and D. G. Lowe. Automatic panoramic

of fisheye lens calibration.

In Proceedings of

image stitching using invariant features. Int. J. Com-

the ISPRS-Congress, pages 765–768. International

put. Vision, 74(1):59–73, Aug. 2007. 2

Society for Photogrammetry and Remote Sensing,

[8] M. Cerman. Structurally correct image segmenta-

2008. 3

tion using local binary patterns and the combinato-

[20] B. Kisacanin, S. Bhattacharyya, and S. Chai. Em-

rial pyramid, 2015. Wien, Techn. Univ., Dipl.-Arb.,

bedded Computer Vision. Springer, 2007. 2

2015, Technical Report 133. 4

[21] D. G. Lowe.

Distinctive image features from

[9] B. Deen. In-Situ Mosaic Production at JPL/MIPL.

scale-invariant keypoints.

Int. J. Comput. Vision,

Pasadena, CA : Jet Propulsion Laboratory, National

60(2):91–110, Nov. 2004. 2, 6

Aeronautics and Space Administration, 2012. Plan-

etary Data: A Workshop for Users and Software De-

[22] T. Luhmann, H. Hastedt, and W. Tecklenburg. Mod-

velopers 2012, JPL TRS 1992+. 3

elling of chromatic aberration for high precision

[10] A. Fitzgibbon. Simultaneous linear estimation of

photogrammetry. Remote Sensing and Spatial In-

multiple view geometry and lens distortion. Pro-

formation Sciences, 36 (Part 5):173–178. 3

ceedings of the 2001 IEEE Computer Society Con-

[23] Y. Luo, M. Jiang, Y. Wong, and Q. Zhao. Multi-

ference on Computer Vision and Pattern Recognition

camera saliency. IEEE Trans. Pattern Anal. Mach.

2001, Vol.1, pp.I-I. 3

Intell., 37(10):2057–2070, 2015. 3

[24] R. Lyon.

The optical mouse, and architectural

and F. Zhang.

A practical distortion correcting

methodology for smart digital sensors. In H.T.Kung,

method from fisheye image to perspective projection

B.Sproull, and G.Steele, editors, Computer Science

image. In Information and Automation, 2015 IEEE

Press. Invited Paper, CMU Conference on VLSI

International Conference on, pages 1178 – 1183,

structures and Computations, 1981. 2

2015. 3

[25] R. Lyon. Apparatus for controlling movement of a

[37] X. Ying, Z. Hu, and H. Zha. Fisheye lenses cali-

curser in computer display system, 1983. European

bration using straight-line spherical perspective pro-

Patent. 2

jection constraint. In P. J. Narayanan, S. K. Nayar,

[26] J. Mallon and P. Whelan. Precise radial un-distortion

and H.-Y. Shum, editors, ACCV (2), volume 3852

of images. Proceedings of the 17th International

of Lecture Notes in Computer Science, pages 61–70.

Conference on Pattern Recognition, 2004, Vol.1,

Springer, 2006. 3

pp.18-21. 3

[38] F. Zhang and F. Liu. Parallax-tolerant image stitch-

[27] F. Perazzi,

A. Sorkine-Hornung,

H. Zimmer,

ing.

In Proceedings of the 2014 IEEE Confer-

P. Kaufmann, O. Wang, S. Watson, and M. Gross.

ence on Computer Vision and Pattern Recognition,

Panoramic video from unstructured camera arrays.

CVPR ’14, pages 3262–3269, Washington, DC,

In Proc. Eurographics 2015, volume 34, 2015. 2, 3

USA, 2014. IEEE Computer Society. 2, 6

[28] B. Rinner and W. Wolf.

An introduction to dis-

[39] B. Zitov and J. Flusser. Image registration meth-

tributed smart cameras. In Proceedings of the IEEE,

ods: a survey. Image and Vision Computing, 2003,

volume 96, pages 1565–1575, 2008. 2

Vol.21(11), pp.977-1000. 2

[29] R. Schneidermann.

Smart cameras clicking with

electronic functions. Electronics, 48:74–81, 1975.

2

[30] E. Schwalbe.

Geometric modelling and calibra-

tion of fisheye lens camera systems. In Proceed-

ings 2nd Panoramic Photogrammetry Workshop, Int.

Archives of Photogrammetry and Remote Sensing,

pages 5–8, 2005. 3

[31] B. Shirmohammadi and C. J. Taylor. Distributed tar-

get tracking using self localizing smart camera net-

works. In Proceedings of the Fourth ACM/IEEE In-

ternational Conference on Distributed Smart Cam-

eras, pages 17–24, New York, NY, USA, 2010. 2

[32] P. Srestasathiern and N. Soontranon. A novel cam-

era calibration method for fish-eye lenses using line

features. ISPRS - International Archives of the Pho-

togrammetry, Remote Sensing and Spatial Informa-

tion Sciences, pages 327–332, Aug. 2014. 3

[33] P. Sturm, S. Ramalingam, and S. Lodha. On cal-

ibration, structure-from-motion and multi-view ge-

ometry for general camera models. In R. Reulke

and U. Knauer, editors, 2nd ISPRS Panoramic Pho-

togrammetry Workshop,, Berlin, Allemagne, 2005.

ISPRS. Published in the Int. Archives of Photogram-

metry, Remote Sensing and Spatial Information Sci-

ences, Vol. XXXVI-5/W8. 3

[34] S. Urban, J. Leitloff, and S. Hinz. Improved wide-

angle, fisheye and omnidirectional camera calibra-

tion. {ISPRS} Journal of Photogrammetry and Re-

mote Sensing, 108:72 – 79, 2015. 3

[35] X.-H. Wang, W.-P. Fu, and W. Chen. Detection of

obstacle based on nocular vision. 2010 International

Conference on Intelligent Computation Technology

and Automation, May 2010, Vol.2, pp.71-74. 6

[36] Z. Wang, H. Liang, X. W. andYipeng Zhao, B. Cai,

C. Tao, Z. Zhang, Y. Wang, S. Li, F. Huang, S. Fu,



21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

A concept for shape representation with linked local coordinate systems Manuela Kaindl and Walter G. Kropatsch

Pattern Recognition and Image Processing Group, Vienna University of Technology, Austria http://www.prip.tuwien.ac.at

Abstract.

object’s element can be described as a transformation

This paper discusses a concept for the repre-

between two linked coordinate systems. Swinging of

sentation of n-dimensional shapes by means of a

the arm can be characterised as a transformation of

model, based on linked local coordinate systems.

the arms coordinate system in respect to the linked

Through application of the medial axis transform

coordinate system of the torso for movement of the

(MAT) and decomposition of the resulting medial

shoulder and transformation of the distal part’s coor-

axis (MA), articulated, as well as non-rigid abstract

dinate system in respect to the system of the upper

n-dimensional bodies can be described by defining

arm for movement of the elbow. The coordinate sys-

corresponding local coordinate systems for each ele-

tem of the hand in respect to the system of the fore-

ment. This should allow a distinct and invariant rep-

arm does not change in that case (Fig. 1). In case of a

resentation of every point of the shape, which can be

smooth deformation, local interpolation between the

used for complex composite transformations of the

transition of the elements may be needed.

object in the context of robotic manipulation.

1. Introduction

For the automatic manipulation of objects and rea-

soning considering their attributes, a powerful model

is needed. Articulated objects, like the human body,

or deformable objects, like a piece of clothing, de-

mand a model that is able to represent complex in-

trinsic transformations. These classes of objects can

be represented by defining coordinate systems for

each segment, so every point of the object is dis-

tinctly determined by a set of coordinates. One appli-

cation, for both classes of objects mentioned, is auto-

mated dressing-assistance for a person. Linked local

coordinate systems should allow the description of

every point of the shape, so it can be exactly defined

where a robotic arm needs to grasp a glove and how it

needs to place it for the person to slip in comfortably,

considering the person’s range of motion.

A coordinate system is specified by its origin, de-

termining the location, and a set of basis vectors,

Figure 1. Linked local coordinate systems of a swinging

defining the orientation and scale of the element.

arm. Frames indicating the area of a coordinate system.

It makes the description of an element invariant to

Forearm and hand do not move in respect to each other

changes. In the case of articulated movements, the

while the linked system of the distal part (parent of hand

specific coordinates of the parts do not need to be

and forearm) of the arm changes in respect to the system

changed. The intrinsic movement of an articulated

of the upper arm.

The intrinsic movement of a non-rigid object is

Handling and predicting articulated objects or

supported by the model’s invariance to deformation

non-rigid objects demands a complex model that can

originating from the axial representation. The ob-

represent the vast amount of different possible ap-

ject’s axial representation provides the linked local

pearances of an object. Several projects have already

coordinate systems. In 3D space, axial representa-

been dedicated to that issue. Li, Chen and Allen [11]

tions can be produced by sweeping spheres along the

used meshes of deformable objects to simulate the

axis [16]. For 2D objects, geometric primitives, like

movement and its results to identify grasping points

circles or line segments, can be used as generators

of garments. With a system of dictionary learning

[14, 20]. The linked local coordinate systems are

via spatial pyramid matching and sparse coding, a

based on the resulting medial axis of the object using

robotic grasper is enabled to grasp, flatten and fold

an end point as the origin and a branch of the medial

garments. Felzenszwalb, Mc Allester and Ramanan

axis as a basis vector of the coordinate system.

[5] published an algorithm for the recognition of de-

Several problems need to be addressed to provide

formable objects in images by means of a discrimina-

a stable and invariant model that can represent an ob-

tively trained, multiscale, deformable part model in

ject and leads to reliable reasoning:

2008. Godec, Roth and Bischof [7] described hough-

based tracking of non-rigid objects in 2013. Their

1. Noise

approach utilises the generalised Hough-transform to

handle articulated and non-rigid objects. Pouch et al.

• Noise inside the shape creating holes.

[13] resort to the MAT to segment the deformable

• Noise along the boundary creating spuri-

aortic valve apparatus in 3D echocardiographic im-

ous branches.

ages.

To provide a stable basis for the concept, a MAT

2. Decomposition

algorithm must be used which can provide a geomet-

• Multiple affiliation of points in branching

rically accurate and compact MA. In recent years,

areas.

several groups have been dedicated to improve prior

efforts in that field. Li et al. published an approach

3. Preservation of structure

for MAT by Quadratic Error Minimization to com-

pute a stable and compact MA [10] The groups of

• Ordering of axes at branching points.

Zhu et al. published a paper on the constructive gen-

eration of the medial axis for solid models [18] and

4. Special shapes

also an approach for calculation of the medial axis

• Spheres and objects based on spheres.

of a CAD model by parallel computation [19]. Aich-

holzer, Aigner, Aurenhammer and Juettler showed a

• Circular MAs.

technique for the MAT by means of a polyhedral unit

The novelty of the method is the utilisation of

ball instead of the standard Euclidean unit ball [2]

linked local coordinate systems for the representation

of n-dimensional objects for robotic manipulation.

3. Method

MAT has the property of producing a MA of one

The paper is organised as follows. In section 2,

dimension less than the object in many cases. A

related work is outlined. Section 3 describes the pro-

3D object creates a 2D MA and a MAT of a 2D

posed method and its open problems in detail. Sec-

MA generates a 1D manifold (Fig.

2.a) that can

tion 4 concludes the paper with a discussion of the

be decomposed at its branching points (Fig.

2.b,

method.

Fig.

2.c).

As branching points we denote loca-

tions where more than 2 branches of the MA meet.

2. Related Work

These points represent the basis of convexities of

Most recently, research in the field of robotic

the shape. Points within the largest inscribing circle

dressing-assistance was done by Gao, Chang and

around these branching points, the branching area,

Demiris, who utilise randomized forests for a model

have an unclear affiliation to a MA branch, which

of the upper body [6]. Klee et al. used a skeleton

poses a problem when the connected MA branches

tracker for a robotic dressing-application.

move in respect to each other.

The decomposed





a

b

a

b

c

Figure. 3 a) Elongated shape with MA and its

Figure. 2. a) MAT of the image of a hand. Largest

straightened representation (b). The representation is

inscribing circles form the MA. b) Decomposed branch

invariant to deformation.

of the MA. c) Area to be described in respect to this

branch.

of MAT, decomposition and straightening creates a

branch of the MA is straightened to form the x-axis

graph with end points and branching points as nodes

of a new coordinate system by replacing the geodesic

and axis branches as edges (Fig. 5).

distances by Euclidean coordinates. The distances

along the MA stay identical, while the curvature is

a

b

removed (Fig. 3).

Figure 4. Coordinate system based on a MA branch. A

point is defined by longitude and latitude.

This makes the representation invariant to defor-

Figure. 5 a) MAT of the image of a hand. b) Graph

mation of the object, except stretching and compres-

created by straightening the MA branches of the hand.

sion, where the geodesic distance may change with

movement. One end point of the axis can be cho-

By means of the graph, the structure of the ob-

sen as the origin. All points within the silhouette of

ject can be identified. The graph concept is based on

the object can be described as a tuple of longitude

the notion of cellular complexes, described by Ko-

along the axis and latitude as the distance of the point

valevsky [9], which states that an n-dimensional ob-

along the normal to the axis (Fig. 4). This procedure

ject is confined by an (n-1)-dimensional object. The





1D MA is confined by 0D points, the 2D MA is con-

ordinates of the MA are replaced by the geodesic co-

fined by 1D curves and so forth. Based on this prin-

ordinates the axis shall have within the shape. Fig. 8

ciple of cellular complexes and the attribute of MAT

shows the 2D object that emerges from the composed

to produce a MA of the objects dimensionality mi-

MA. This 2D object itself can be used as the MA for

nus 1 in many cases, it is assumed that the proposed

a 3D object. This concept can be be continued due

method holds for many n-dimensional objects by re-

to the MAT’s attribute to create an object with the di-

cursive application until 1-dimensionality is reached.

mensionality of the object minus 1. So its reversal

To communicate the principle of MA, we show

leads to an object with the dimensionality of the MA

how to build an abstract object from its MA. A shape

plus 1.

can be created by sweeping a circle along a 1D Axis

as can be seen in Fig. 6. The MA is synonymous

Figure 8. Covered area as 2D MA of a 3D object.

Figure 6. Circles swept along a 1D MA. Transparency in-

dicates the sweeping movement.

3.1. Noise

Noise on the boundary of the shape can cause spu-

with the x axis of a coordinate system we use to de-

rious branches. Noise within the shape may cause

fine all points of the shape. The radius of every circle

holes, which can lead to circular MAs.

Several

at position x along the axis has to be stored to cre-

projects are dedicated to the reduction of the influ-

ate the intended object. This assures the preservation

ence of noise on the MAT. Most recently Spitzner

of shape. Given that the circles have to touch the

and Gonzalez [17] published a method called Shape

outline of the shape at at least 2 points at all times

Peeling to improve the stability of image skeletons.

and no circle is completely contained in another, the

Abiva and Larsson [1] proposed a method to utilise

silhouette of all the circles combined describes the

the Scale Axis Transform to prune the MA of spuri-

shape that is to be produced [3]. Noise on the bound-

ous branches. Montero and Lang [12] published an

ary of the object can cause spurious branches, mean-

algorithm for skeleton pruning by means of contour

ing branches of the MA that do not hold valuable in-

approximation and the integer MAT in 2012.

formation about the appearance of the shape. Noise

within the object may cause holes and therefore cir-

3.2. Decomposition

cular MAs. In Fig.7, we compose several branches to

Decomposition is performed in branching areas to

obtain less complex axes. Serino, Arcelli and Sanniti

di Baja [15] recently described the decomposition of

3D objects at branching points to obtain meaningful

object parts. In 2D, the branching area lies within the

largest inscribing circle where 2 or more branches

of the axis meet in the centre (Fig. 9). While the

Figure 7. Circles swept along a composed 1D MA. Trans-

points of the shape lying in a circle that only belongs

parency indicates the sweeping movement.

to one axis, are uniquely defined, points within the

branching area can be described in relation to sev-

one MA. The constellation of branches determines

eral branches of the axis (Fig. 10). If branches move

the structure of the object. The structure can have

in respect to each other, these points shall each be

different constraints in its movement, depending on

affiliated with only one branch to preserve a unique

the intrinsic mobility of the object. This topic is dis-

representation. While Serino, Arcelli and Sanniti di

cussed further in chapter 3.3 Preservation of struc-

Baja [15] can already demonstrate impressive experi-

ture. When creating the 2D object, the Euclidean co-

mental results of the decomposition of the composed





Figure 9. A point within a branching area can be described

in relation to several branches of the axis.

Figure 12. A sphere swept along the branching curve cre-

ating a new 3D object based on a sphere.

Figure 10. A point within a branching area can be de-

The branching area itself can be seen as a 3D rod-

scribed in relation to several branches of the axis. Axis

like object or as a 4D object created by sweeping a

a is extended across the centre, illustrating its negative do-

3D sphere along an axis. This implies a leap of at

main.

least 2 dimensions to reach the 1D MA, which vi-

olates the assumption that the MAT reduces the di-

1D MA of 3D objects, MAs of higher dimensions

mensionality of an object by 1. A problem that is yet

require further research.

to be solved and is explained further in the chapter

In 3D, there can be branching points or branching

3.4.1 Spheres and objects based on spheres.

curves where the branches of the MA meet as can

A different approach is to apply the MAT recur-

be seen in Fig. 11. In a first idea we approach the

sively to every branch of the MA until 1D is reached.

branching area as if it is an object itself. The branch-

This way, joints will not necessarily imply a con-

ing area of a curve we define by the largest inscribing

nection of the MA branches (Fig. 13) and the MA

sphere that is swept along the branching curve (Fig.

branches of an object might not intersect. If the MA

12).

breaks into several pieces, it arises the question of

how the structure can be maintained. Further work

on this matter is required.

Figure 11. Two 2D MA branches of a 3D object forming

Figure 13. 1D MA branches of the 2D MA branches do

a branching curve where they intersect.

not intersect.





3.3. Preservation of structure

a

b

Articulated objects with a specific range of motion

require constraints at joints, so the human forearm

can not rotate around the elbow, but can only flex in

one direction to a certain degree. Non-rigid objects,

like cloth, require different constraints since they do

not have joints, but feature a certain thickness, stiff-

c

d

ness, weight and other properties. A basic ordering

has to be maintained regardless of these characteris-

tics. As shown in Fig. 14, all MA branches might be

Figure 15. a) 3D sphere producing a 0D MA. b) Equator

applied to a sphere to provide orientation for the

spherical coordinate system. c,d) Shape described by

sweeping a spherical coordinate system along a path.

Figure 14. 3D branching point of MA branches. Branch a

ates a 0D MA. Fig. 15.b illustrates the sphere after

can move freely except across the triangles spanned by

the other branches b, c and d.

application of an equator to orient the spherical co-

ordinate system. With these systems, all points of a

able to move freely, provided they do not cross planes

sphere can be distinctly determined. Objects based

spanned by two different axes to sustain the objects

on spheres imply that the shape can be created by

organisation. The structure can be preserved by con-

moving a sphere along a path (Fig. 15.c, Fig. 15.d).

sidering the branches of the MA as edges and the end

It follows, therefore, that every point of the object

points and branching points as nodes of a combina-

based on a sphere can be uniquely determined when

torial map as described by Damiand and Lienhardt

the spherical coordinate system is moved along the

[4].

MA.

3.4. Special shapes

3.4.2

Circular medial axes

There are several open problems regarding special

shapes in the method that require further research.

Circular medial axes occur when an object element

Thoughts of the community on the matter are highly

has genus higher than 0 (Fig. 16.a) and at concavities

appreciated.

of the object (Fig. 16.b). If a circular MA branch is

connected to 1 or more other branches of the MA, the

branching points can be used to decompose the cir-

3.4.1

Spheres and objects based on spheres

cular MA and therefore create non-circular sections

The concept of MAT is mostly built on the usage of

that can be treated regularly. This is the case if the

circles and spheres. If an object, or a part of it, itself

object features a tail. Elements with genus higher

is one of these primitives or based on the primitive in

than 1 also feature connected MA branches because

a higher dimension, the MAT will not create an ob-

of the bridge between the holes whose MA branch

ject of its dimension minus 1, but it may create a MA

connects the sides. This leaves an issue for objects

with a dimensionality even lower. This violates our

with genus 1 and no tail (Fig. 16.a) and objects with

basic assumption that this is the case. This means

convex elements (Fig.

16.b).

The n-dimensional

that the MA can not be used to determine the loca-

MA is not confined by a (n-1)-dimensional object,

tion of points of the shape uniquely. One approach

which violates one of the basic assumptions of this

to solve this problem is to utilise spherical coordi-

method, namely the concept of cellular complexes. If

nate systems. Fig. 15.a shows a 3D sphere that cre-

an object produces a circular MA without connected





a

b

for coordinate-systems. The 1D elements as edges

and their end points as nodes, form a graph that rep-

resents the object. Articulated, as well as non-rigid

objects can be described by defining corresponding

coordinate systems of each element. This should al-

low complex composite transformations of the ob-

ject. Intrinsic movement does not imply the transfor-

mation of point-clouds or meshes, but of linked local

Figure 16. a) 2D circular MA within a tube-like object

with an arbitrarily set reference (white). b) 2D circular

coordinate systems.

MA branch as part of an object’s MA with an arbitrarily

Further work to be done on the project is to pro-

set reference (white).

vide a proof of concept, especially concerning the

feasibility of the method for n-dimensions and res-

branches, there is no reference point that can be used

olution of the open problems described in this paper.

as the origin of the coordinate system. A first at-

tempt to solve this problem, based on the findings of





Acknowledgements


Illetschko [8], is to place an arbitrary reference point.

We would like to thank the reviewers for construc-

This point can be used as the origin of the coordinate

tive feedback and the PRIP Club, the organization

system based on the MA. Depending on the dimen-

of friends and promoters of Pattern Recognition and

sionality of the object, also a cut can be necessary.

Image Processing activities Vienna, Austria, for sup-

Points within the area of the new origin can then be

port.

defined in relation to both end points of the MA.

References

A special case is shown in Fig. 17. The torus

is a shape based on a sphere, meaning that it can be

[1] J. Abiva and L. J. Larsson. Towards automated fil-

described as a sphere moved along a circular path. As

tering of the medial axis using the scale axis trans-

explained earlier, this enforces the use of a spherical

form. In Research in Shape Modeling, pages 115–

coordinate system. Also the torus has a circular MA,

127. Springer, 2015. 4

which requires an arbitrarily set reference point.

[2] O. Aichholzer, W. Aigner, F. Aurenhammer, and

B. Juettler. Exact medial axis computation for trian-

gulated solids with respect to piecewise linear met-

rics. In J. Boissonnat, P. Chenin, A. Cohen, C. Gout,

T. Lyche, M. Mazure, and L. Schumaker, editors,

Curves and Surfaces, volume 6920 of Lecture Notes

in Computer Science, pages 1–27. Springer Berlin

Heidelberg, 2012. 2

[3] H. Blum. A Transformation for extracting new de-

scriptors of shape. MIT Press, 1967. 4

Figure 17. Special case: Torus is a shape based on a sphere

[4] G. Damiand and P. Lienhardt. Combinatorial Maps:

and creates a circular MA. From arbitrarily set reference

Efficient Data Structures for Computer Graphics

point on the MA (white), a spherical coordinate system is

and Image Processing, volume 129. A. K. Peters,

swept along the MA.

Ltd. Natick, MA, USA, 2014. 6

[5] P. Felzenszwalb, D. Mc Allester, and D. Ramanan.

4. Conclusion

A discriminatively trained, multiscale, deformable

part model. Computer Vision and Pattern Recogni-

This paper proposes an novel concept for the

tion, 2008. CVPR 2008. IEEE Conference on. IEEE,

representation of n-dimensional shapes through a

2008. 2

model, based on linked local coordinate-systems.

[6] Y. Gao, H. Chang, and Y. Demiris. User modelling

Through recursive application of the MAT and

for personalised dressing assistance by humanoid

decomposition of the resulting MA, some n-

robots. In Intelligent Robots and Systems (IROS),

dimensional objects can be reduced to multiple 1-

2015 IEEE/RSJ International Conference on, pages

dimensional sub-elements that are used as the axis

1840–1845, Sept 2015. 2

[7] M. Godec, P. Roth, and H. Bischof. Hough-based

[20] Y. Zhu, F. Sun, Y. Choi, B. Juettler, and W. Wang.

tracking of non-rigid objects, volume 117. Elsevier,

Computing a compact spline representation of the

2011. 2

medial axis transform of a 2D shape, volume 76. El-

sevier, 2014. 2

[8] T. Illetschko. Minimal combinatorial maps for an-

alyzing 3D data. Diploma Thesis, TU Wien, 2006.

7

[9] V. A. Kovalevsky. Finite Topology as Applied to Im-

age Analysis, volume 2. Academic Press, 1989. 3

[10] P. Li, B. Wang, F. Sun, X. Guo, C. Zhang, and

W. Wang.

Q-mat: Computing medial axis trans-

form by quadratic error minimization. ACM Trans.

Graph., 35(1):8:1–8:16, December 2015. 2

[11] Y. Li, C. Chen, and P. Allen. Recognition of de-

formable object category and pose. Proceedings of

the IEEE International Conference on Robotics and

Automation (ICRA), 2014. 2

[12] A. Montero and J. Lang. Skeleton pruning by con-

tour approximation and the integer medial axis trans-

form. Computers & Graphics, 36(5):477–487, 2012.

4

[13] A. Pouch, S. Tian, M. Takabe, H. Wang, J. Yuan,

A. Cheung, B. Jackson, J. Gorman, R. Gorman, and

P. Yushkevich.

Segmentation of the aortic valve

apparatus in 3d echocardiographic images:

De-

formable modeling of a branching medial structure.

In Statistical Atlases and Computational Models of

the Heart - Imaging and Modelling Challenges, vol-

ume 8896 of Lecture Notes in Computer Science,

pages 196–203. Springer International Publishing,

2015. 2

[14] A. Rosenfeld. Axial representations of shape, vol-

ume 33. Academic Press Professional, 1986. 2

[15] L. Serino, C. Arcelli, and G. Sanniti di Baja. From

skeleton branches to object parts, volume 129. El-

sevier, 2014. 4

[16] E. Sherbrooke, N. Patrikalakis, and E. Brisson. An

Algorithm for the Medial Axis Transform of 3D Poly-

hedral Solids, volume 2. IEEE Educational Activi-

ties Department Piscataway, 1996. 2

[17] M. Spitzner and R. Gonzalez.

Shape peeling for

improved image skeleton stability. In 2015 IEEE

International Conference on Acoustics, Speech and

Signal Processing, ICASSP 2015, South Brisbane,

Queensland, Australia, April 19-24, 2015, pages

1508–1512, 2015. 4

[18] H. Zhu, Y. Liu, J. Bai, and X. Ye.

Construc-

tive generation of the medial axis for solid models.

Computer-Aided Design, 62:98 – 111, 2015. 2

[19] H. Zhu, Y. Liu, J. Zhao, and H. Wang. Calculat-

ing the medial axis of a {CAD} model by multi-cpu

based parallel computation. Advances in Engineer-

ing Software, 85:96 – 107, 2015. 2

21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

A Computer Vision System for Chess Game Tracking

Can Koray

Emre Sümer

Department of Computer Engineering

Department of Computer Engineering

Bas¸kent University

Bas¸kent University

Ankara, TURKEY

Ankara, TURKEY

cannkorayy@gmail.com

esumer@baskent.edu.tr

Abstract. In this paper, we present a real-time sys-

track by the system.

tem that allows the detection of the moves of a chess

On the other hand, in a study conducted by Bennet

game.

In the proposed approach, each captured

and Lasenby [2], the recognition of chessboards un-

video frame, from a RGB webcam positioned over the

der deformation was carried out. Their method deter-

chessboard, is processed through the following steps;

mined a grid structure to detected vertices of a chess-

the detection of the corner points of the chessboard

board projection. Further, the same authors devel-

grids, geometric rectification, chessboard position

oped a feature detector named ‘Chess-board Extrac-

adjustment, automatic camera exposure adjustment,

tion by Subtraction and Summation (ChESS)’ to re-

intensity adjustment, move detection and chessboard

spond to chessboard vertices [3]. In a different study,

drawing. All steps were implemented in MATLAB

a chessboard recognition system was proposed [8].

programming environment without using any chess

The proposed system was applied to chessboard in

engine. The proposed approach correctly identified

order to identify the name, location and the color of

162 of 164 moves in 3 games played under different

the pieces. Piskorec et al. [10] presented a com-

illumination conditions.

puter vision system for chess game reconstruction.

The system reconstructs a chessboard state based on

video sequences obtained from two cameras.

1. Introduction

The tracking of the chess moves can be regarded

There are many systems of computer vision,

as the preliminary task before designing a robotic

which require algorithms to be able to recognize

chess playing system.

In the literature, there are

different objects and scenes.

Since, chess game

several efforts that perform the chess move tracking.

has become an interesting issue in terms of human-

The studies conducted by Matuszek et al. [9], Urting

computer interaction systems, a computer vision sys-

and Berbers [12], Cour et al. [4] and Gonc¸vales et al.

tem is needed for chess playing and chessboard

[7] use unique algorithms to identify the chessboard

recognition system.

grids along with the classification of squares. These

There are various published techniques related to

methods are not only based on corner detection but

chess-playing systems. Sokic and Ahic-Djokic [11]

also rely on having a clean background.

proposed a computer vision system for chess playing

In this paper, we propose a real-time chess game

robot manipulator as a project-based learning sys-

tracking system using a RGB webcam positioned

tem. The proposed algorithm detects chess moves by

over the chessboard. In general, the move is detected

comparing frames captured before, during and after

by comparing the occupancy grids based on average

a move, and finds the difference between them. In a

color information of the pieces and the squares. Be-

similar study, Atas¸ et al. [1] developed a chess play-

fore that, several pre-processing steps are employed

ing robotic arm system composed of various modules

including geometric rectification, intensity adjust-

such as main controller, image processing, machine

ment and chessboard position adjustment. The sys-

learning, game engine and motion engine of robot

tem also works successfully under different illumina-

arm. In their study, the top of the pieces are uniquely

tion conditions by means of automatic camera expo-

designed to be different from each other in order to

sure adjustment. Besides been a tracking system, the





proposed system can also perform 2D reconstruction

of the chessboard states and generate movement logs.

2. Equipment and Setup

In this work, a setup is prepared to detect the

moves of the pieces during the game. The setup has

the Logitech c310 webcam for the capturing footage.

The camera which has 5 megapixels resolution is ca-

pable of HD 720p recording. The camera has no aut-

ofocus functionality. Only the exposure mode from

the camera settings is changed to ‘manual mode’ for

move detection process. The webcam was used on a

mid-range notebook. The chessboard and pieces are

selected to meet World Chess Federation (FIDE) re-

quirements in terms of color and size [6]. The board

and the pieces have different colors from each other.

The colors of the pieces are black and white, while

the board has dark and light brown colored squares.

The camera is positioned over the chessboard by a

long and flexible holder as shown in Figure 1.

Figure 2. The overall framework

Figure 1. The image of the setup

located, the saturation value of the captured image is

increased gradually as a pre-process step. Once all

grid corners are located, the second step is to locate

3. The Overall Framework

the chessboard corners (point-C in Figure 3(c)). The

grid corner points which are closest to the corners

The general block diagram of the proposed sys-

of the image are selected as pivot points. Point-A in

tem is given in Figure 2. The details of the steps of

Figure 3(c) is one of the pivot points. The diagonal

the proposed framework are given in the further sub-

closest inner point to the point-A is point-B, which is

sections.

shown in Figure 3(c). The reflection of the point-B

3.1. Chessboard Grid Corner Detection

over the point-A is the point-C, which is the one of

the chessboard corners as shown in Figure 3(c). This

In this process, the first step is to find all grid cor-

procedure is applied for all remaining corner points.

ner points of the chessboard (Figure 3(a)) by using

the snapshot of the camera. To find grid corners (Fig-

3.2. Geometric Rectification

ure 3(b)), we used detectCheckerboardPoints func-

tion of MATLAB. The function that is particularly

The geometric rectification is an essential step to

used in camera calibration gets an RGB image as an

isolate the chessboard from the environment and cor-

input and returns the located grid corners and the size

rect the perspective distortion of the chessboard to

of the board as an output. Until all grid corners are

pave the way for the other processes. The chessboard





is warped from its corner points which are located in

the previous section to coincide with our predeter-

mined size square corners (480x480px) (Figure 4).

(a)

Figure 4. The chessboard before geometric rectification

step

This process is applied only once before the game

starts therefore, either the camera or the board should

not be moved during the game. The geometrically

corrected chessboard is presented in Figure 5.

(b)

Figure 5. The chessboard after geometric rectification step

3.3. Chessboard Position Adjustment

(c)

Figure 3. (a) The original chessboard image, (b) the de-

To ease the calculations of the future processes,

tected chessboard grid corners and (c) the related points

the white pieces are needed to be positioned at the

to chessboard grid corner detection

bottom of the view. Thanks to camera position, we

know that the positions of the pieces have to be on the

left and right side of the camera. The comparison of

the average colors of the both side’s king square gives





us the position of the white pieces. According to the

white side position, a new transformation matrix is

computed to be used in the future warping processes.

In Figure 6, the white pieces are located at the bottom

while the black ones are at the top.

Figure 6. The chessboard after chessboard position adjust-

ment

Figure 7. The pseudo-code of the automatic camera expo-

sure adjustment

3.4. Automatic Camera Exposure Adjustment

The built-in automatic exposure mode of the cam-

era may cause undesirable image acquisition for the

move detection. In this mode, the camera continu-

ously adjusts the exposure level according to the cap-

tured footage. Especially, whenever the player makes

a move, the camera changes its exposure level due to

the player hand on the captured image. In addition,

the exposure level which is adjusted by built-in au-

tomatic exposure mode of the camera can be under

or overexposure. In order to find optimum exposure

level of the camera, it needs to be adjusted manually

at the beginning of the game. The aim of this process

is to get correct color values as much as possible by

preventing under and overexposure situations. We

Figure 8. The chessboard after automatic camera exposure

proposed our automatic camera exposure algorithm

adjustment

that aims to find the optimum exposure level which

maximizes the average of the color differences be-

3.5. Intensity Adjustment

tween light/dark piece and square (Figure 7). The

calculated optimum exposure level is set to camera

To improve the image quality, a set of enhance-

as a new exposure level for the following processes.

ments is applied to the snapshots of the camera. The

In the present case the computed exposure level was

first one is to reduce the noise problem. We used a

computed to be -6 where the full range is between -

5x5 median filter to minimize the noise level of the

9 (the darkest) and 0 (the lightest). The snapshot of

images. The second one is to increase the saturation

the chessboard after applying the computed exposure

of the image to enhance colors. After this process,

level is given in Figure 8.

the average colors of pieces and squares are calcu-





lated to be used in further processes. The image of

• Reference color of the light pieces is calculated

the chessboard after the intensity adjustment step is

by taking the average of the 16 squares that are

illustrated in Figure 9.

occupied by the light pieces.

• Reference color of the dark pieces is calculated

by taking the average of the 16 squares that are

occupied by the dark pieces.

• Reference color of the light squares is calculated

by taking the average of the 16 light squares that

are not occupied by any pieces.

• Reference color of the dark squares is calculated

by taking the average of the 16 dark squares that

are not occupied by any pieces.

3.7. Move Detection

The implementation of the move detection is

based on a comparison between the reference im-

Figure 9. The chessboard after intensity adjustment

age and the snapshot of the camera. For this pro-

cess, the reference image is used as the first snapshot

3.6. Average Color References

which is taken after each valid move. The first ref-

After all enhancements, in order to get color val-

erence image is regarded as the first snapshot of the

ues of each square of the chessboard, the image of

footage. During the game, the average color differ-

the chessboard (Figure 9) divided into 64 identical

ence is calculated between the reference image and

pieces each in correspondence to a square of the

snapshots. Whenever the result of the calculation ex-

board. Therefore occupancy grids are created for the

ceeds a predetermined threshold, we conclude that

chessboard. After that, it is defined a region of in-

the player makes a move. After the result goes down

terest (ROI) for each square (grid) of the chessboard.

below the threshold, we assume that the player fin-

The primary aim of using ROIs is to get color infor-

ished the move.

mation of the piece. ROI is defined as a 25x25px

At this point, the last snapshot is interpreted to de-

rectangle from the center of each square as shown in

termine the color and position of the pieces. Before

Figure 10.

this process, the last snapshot is warped and the en-

hancements are applied to the warped image of the

chessboard. The ROI within each square of the im-

age is compared with the four reference colors which

are determined in section 3.6. In this comparison, the

color differences are calculated in Lab color space by

computing the deltaE value that represents the Eu-

clidean distance of the related items. As a result of

the comparisons, the reference color that gives the

minimum deltaE value determines if a grid cell is a

Figure 10. The orange colored region of interest superim-

square or a piece with light or dark color. By ap-

posed on the pawn

plying this process to all squares of the chessboard,

the chessboard state of the last snapshot is revealed.

At the beginning of the game, before the move

The state of the last snapshot and the previous chess-

detection, the average color values of the light/dark

board state are compared to detect the move of the

pieces and squares are received and recorded as ref-

piece. The previous chessboard state represents the

erence values.

chessboard state of the last valid move. At the begin-

The reference colors of each type of piece and

ning of the game, the first state of the game is stored

square calculated as follows:

as the previous chessboard state.



When the state of the snapshot and the previ-

ous chessboard state are compared, six different out-

comes can be obtained:

1. If there is no difference between previous and

last states, this means there is no change in the

game. For this reason, the color difference over

the board is not a move.

2. If there are only one occupied and only one un-

occupied squares difference with the same piece

color then this is a move.

3. If there are two occupied and two unoccupied

squares difference with the same piece color,

then this is a special move called ’castling’.

4. If there are one occupied and one unoccupied

Figure 11. The reconstructed chessboard state with move

squares difference with the same piece color and

list

one unoccupied square difference with the other

piece color, then this is another special move

The saturation enhancement which is applied to

called ’en passant’.

the images taken from capturing footage helped to in-

crease the accuracy of the average color differences.

5. If there is only one unoccupied square differ-

The combination of the lighting, camera settings

ence and if there is a piece color change to the

and chess set are playing a big role in the success

previous piece color of the unoccupied cell in

of detecting moves in a chess game. Although the

any other occupied square, then this is a captur-

proposed system works well under different illumi-

ing move.

nation conditions, lighting environments (having a

single light source) that cast strong shadows over the

6. For all other conditions, the result of the com-

board are unsuitable for tracking.

parison is not a move.

On the other hand, shadows over the light pieces

If the result is a move then the state of the chess-

are another important problem. This makes difficult

board is updated as the last chessboard state. The

to separate the light pieces from the light squares, as

move is added to the move list and the last state of

in 2 of 164 undetected moves during the experimen-

the chessboard is reconstructed in 2D. An example

tal evaluation. In addition, this problem may cause to

2D state reconstructed from a test game and the move

get incorrect results from the automatic camera expo-

list are presented in Figure 11. The moves are logged

sure adjustment method.

as standard algebraic notation which is the notation

Shadows and specular reflections over a particu-

standardized by World Chess Federation (FIDE) [5].

lar area of the chessboard can break the uniformity

Note that all the steps of the proposed methodology

of the colors. In these conditions, a chess game can-

including the graphical user interface were imple-

not be tracked by the proposed system. Besides, due

mented in MATLAB.

to the reference colors of pieces and squares are de-

termined at the beginning of the game, the overall

4. Experimental Evaluation and Discussion

illumination of the environment should not change

dramatically during the game. Otherwise, the move

In order to test the system, three chess games are

detection cannot be possible by the system.

played at different times having different illumina-

tion conditions. In these tests, 162 moves of all 164

5. Conclusion

moves are successfully detected by the system. The

corner points of the chessboard are successfully lo-

In this paper, we have presented a real-time sys-

cated in all games. The system performance was

tem that performs the detection of the chess moves.

found to be satisfactory to detect moves in real-time.

The preprocessing steps are found to be quite useful.

In particular, automatic camera exposure adjustment

tion. Proceedings of the 34th International Conven-

highly reduces the color ambiguities. The environ-

tion, pages 870–876, 2011. 1

ments which are heavily under the influence of direc-

[11] E. Sokic and M. Ahic-Dokic. Simple computer vi-

tional lights are not recommended because of casting

sion system for chess playing robot manipulator as

strong shadows. The results of the played games in-

a project-based learning example. Proceedings of

the IEEE International Symposium on Signal Pro-

dicate that the proposed system can be an affordable

cessing and Information Technology, pages 75–79,

and efficient option among chess game tracking sys-

2008. 1

tems.

[12] D. Urting and Y. Berbers.

Marineblue: A low-

As an addition to the current system, a chess move

cost chess robot. Proceedings of the International

validation system is under progress to interpret the

Conference Robotics and Applications, pages 76–

player moves.

By this way, the system not only

81, 2003. 1

tracks the position of the pieces but also validates

the movements according to the type of the piece.

Therefore, the future system can be used to help de-

cision making and monitoring by referees and anti-

cheat committee.

References

[1] M. Atas¸, Y. Do˘gan, and ˙I. Atas¸.

Chess playing

robotic arm. Proceedings of the IEEE 22nd Signal

Processing and Communications Applications Con-

ference, pages 1171–1174, 2014. 1

[2] S. Bennet and J. Lasenby. Robust recognition of

chess-boards under deformation. Proceedings of the

20th IEEE International Conference on Image Pro-

cessing, pages 2650–2654, 2013. 1

[3] S. Bennet and J. Lasenby. Chess – quick and robust

detection of chess-board features. Computer Vision

and Image Understanding, 118:197–210, 2014. 1

[4] T. Cour, R. Lauranson, and M. Vachette.

Au-

tonomous chess-playing robot, 2006. 1

[5] FIDE. Handbook, 2015. Laws Of Chess. 6

[6] FIDE. Handbook, 2015. Standards of Chess Equip-

ment and Tournament Venue. 2

[7] J. Gonc¸alves, J. Lima, and P. Leitao. Chess robot

system: A multi-disciplinary experience in automa-

tion.

Proceedings of the 9th Spanish-Portuguese

Congress on Electrical Engineering, 2005. 1

[8] I. M. Khater, A. S. Ghorab, and I. A. Aljar-

rah. Chessboard recognition system using signature,

principle component analysis and color information.

Proceedings of the Second International Conference

on Digital Information Processing and Communica-

tions, pages 141–145, 2012. 1

[9] C. Matuszek, B. Mayton, R. Aimi, M. P. Deisenroth,

L. Bo, R. Chu, M. Kung, L. LeGrand, J. R. Smith,

and D. Fox. Gambit: A robust chess-playing robotic

system. Proceedings of the IEEE International Con-

ference on Robotics and Automation, pages 4291–

4297, 2011. 1

[10] M. Piskorec,

N. Antulov-Fantulin,

J. Curic,

O. Dragoljevic, V. Ivanac, and L. Karlovic. Com-

puter vision system for the chess game reconstruc-

21st Computer Vision Winter Workshop

Luka Čehovin, Rok Mandeljc, Vitomir Štruc (eds.)

Rimske Toplice, Slovenia, February 3–5, 2016

Fast L1-based RANSAC for homography estimation

Jonáš Šer´ych, Jiř´ı Matas, Ondřej Drbohlav

Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Cybernetics, Center for Machine Perception,

Technická 2, 166 27 Praha 6, Czech Republic

{serycjon,matas,drbohlav}@fel.cvut.cz

Abstract. We revisit the problem of local optimiza-

sistent with all correct correspondences.

tion (LO) in RANSAC for homography estimation.

The problem was first addressed in a paper by

The standard state-of-the-art LO-RANSAC improves

Chum et al [2] who proposed an additional RANSAC

the plain version’s accuracy and stability, but it may

step, the so called local optimization (LO). The LO

be computationally demanding, it is complex to im-

step is employed whenever a new candidate model

plement and requires setting multiple parameters.

M is the best one so far found in the RANSAC loop,

We show that employing L1 minimization instead of

i.e. it has more inliers than any of the models esti-

the standard LO step of LO-RANSAC leads to results

mated from the random minimal samples evaluated

with similar precision. At the same time, the pro-

so far. Chum et al [2] proved that with the strategy,

posed L1 minimization is significantly faster than the

the LO step is run only log(k) times, where k is the

standard LO step of [8], it is easy to implement and

number of random models tested.

it has only a few of parameters which all have in-

The local optimization step[2] performs various

tuitive interpretation. On the negative side, the L1

heuristic procedures with the objective of increasing

minimization does not achieve the robustness of the

the accuracy of M , such as generating hypotheses by

standard LO step, its probability of failure is higher.

resampling the inliers of M and performing iterative

least square estimation combined with scheduled in-

lier threshold changes. The standard implementation

1. Introduction

of RANSAC with the local optimization step, found

RANSAC [3] is a robust model fitting algorithm

in the commonly used publicly available code [10], is

that is the standard method used for two-view geom-

a combination of the above-mentioned heuristic pro-

etry estimation [5]. The plain version of RANSAC

cedures.

proceeds as follows: (i) randomly sample the mini-

The choice and parameter settings of local opti-

mum number of points required to calculate model

mization methods influence the speed and accuracy

parameters, (ii) compute the cardinality of the set

of the algorithm. In the state-of-the-art version [8],

consistent with that model, i.e. the number of in-

the LO step executes a complex procedure which in-

liers, and (iii) terminate if the probability that a bet-

volves repeated sampling from inliers of M and re-

ter model than the one best so far will be found

peated iterative least squares minimisation. As the

falls under a predefined threshold. The precision of

sampling is involved, it is stochastic1. Due to both

the model returned by the algorithm is typically im-

repeated sampling and iterative least squares, it is

proved by least square fitting of the inliers of the best

so computationally demanding, in comparison with

mode.

RANSAC steps (i) and (ii), that despite being exe-

cuted only rarely, the LO step significantly influences

It has been observed [11] that the termination cri-

the overall running time.

terion (iii) stops the process later than expected given

In this paper, we propose to replace the complex

the recovered percentage of inliners. The discrep-

LO procedure of Lebeda et al. by minimization of the

ancy is due to a generally incorrect (overoptimistic)

assumption that every minimal sample of inliers gen-

1Since the outer loop of RANSAC is stochastic, the inner

erates a “good” model, i.e. a model that will be con-

sampling does not change the character of the algorithm.

1: procedure STANDARD LO

1: procedure L1-BASED LO

2:

Input: M (model estimated by LSq),

2:

Input: M (minimum sample model),

I (inliers)

I (inliers)

3:

for r = 1 → reps do

3:

while stopping condition not met do

4:

sample S drawn from inliers

4:

M ← model estimated from inliers by

5:

model M is estimated from S

IRLS optimization

6:

iterative least squares on M

5:

I ← inliers to M

7:

end for

6:

end while

8:

return best model

7:

return M

9: end procedure

8: end procedure

Table 1: Comparison of the standard and proposed local optimization procedures in RANSAC – left and right columns, respectively. IRLS stands for interated re-weighted least squares. Note that the standard LO method includes several rounds of IRLS s which are themselves computationally demanding (for details, see [8]).The

iteration stops if either the change in the cost function is below 10−3 or the maximum number of iterations is reached (set to 5).

sum of L1 norms of the residuals, ie. the algebraic

Weiszfeld proved that the geometric mean minimia-

errors of the model on individual points. The mini-

tion by IRLS requires solving repeated least squares

mizer of the L1 norm, also known as geometric me-

problems where each data point is weighted by the

dian, is robust to a modest contamination by outliers.

reciprocal of its residual to the current estimate of

This means that RANSAC becomes less sensitive to

the model. The algorithm has to be modified to avoid

the inlier-outlier threshold. The threshold, a critical

singularities when some point is exactly consistent

parameter of RANSAC, can be set more loosely and

with the model, i.e. it has a zero residual. To avoid

thus cover a wide range of problems. Moreover, the

the problem, we replace L1 minimization with Huber

L1-based procedure need not include least square es-

kernel minimization. In the implementation, it only

timation with multiple thresholds, thus saving time.

means that points with small residuals are not scaled.

In practice we replace the L1 norm by the Huber

First, the necessary notation is introduced. The L22

robust kernel response to the inlier algebraic errors.

norm (for a vector r ∈

D

R ) is defined as:

The Huber cost function is defined in Eq. 6. The Hu-

ber kernel is differentiable and convex and the global

D

X

k

|

minimum of the cost function can be found by gra-

rk22 =

rj|2,

(1)

dient descent. The gradient minimization alternates

j=1

with the inlier-outlier selection process. The alternat-

the L1

D ) as

ing minimisation can be seen as local optimization of

2 norm (for r ∈ R

the truncated Huber kernel. The procedure has only a

v D

small number of parameters that have intuitive mean-

u

X

k

u

rk1

|r

ing, it is simple, and deterministic.

2 = t

j |2

(2)

j=1

We show that the minimization produces errors

which are comparable to the standard LO-RANSAC,

2.1. Homography estimation by algebraic error

while being computationally much less expensive –

minimization in L2 and L1 norms

of an order of magnitude in our experiments com-

2

2

pared to the standard local optimization.

Let the number of correspondences be M . The

data matrix Z is computed from correspondences by

2. Method

a standard procedure ([5]): Let (x, y) and (x0, y0) be

the correspondence pair. It generates two rows into

The difference of the standard and proposed LO

the data matrix Z:

method is presented in Table 1.

The L1 mini-

mization is carried out by iterated reweighted least



x

y

1

0

0

0

−x0x −x0y

−x0 .

squares (IRLS). The particular instantiation of IRLS

0

0

0

x

y

1

−y0x

−y0y

−y0

is knows as the generalized Weiszfeld algorithm [1].

(3)

Let z(i) denote the two rows generated by i-th corre-

6:

Recompute h using L22 optimization (4)

spondence. The homography h is estimated from Z

7:

end while

by one of the following optimizations:

8: end procedure

The L2 optimization

2

The iteration stops when

M

X

(i)

X

(i)

X

r

−

r

≈ 0

h = argmin

kz(i) ˆ

hk2

t

(t+1)

2 ,

(subj. to ˆ

h> ˆ

h = 1)

î

i

h

i=1

(4)

, i.e. if the value of the cost function does not change

The minimization is solved by computing the spec-

between consecutive iterations or after 5 iterations

tral decomposition of Z>Z and taking the eigenvec-

are completed. The second condition reflects the em-

tor corresponding to the smallest eigenvalue. The

pirical observation that most of the time, the IRLS

algorithm has the following properties: it is fast,

algorithm converges after 3 iterations and it is used

but not robust with a breakdown point of zero [7]

only as a guarantee against an infinite loop.

– in general a single outlier can make h arbitrarily

In the case of L12 optimization, the weight w(i) is

wrong2.

set to 1/ kr(i)k1 + δ

2

. A small constant δ is used

The L1 optimization, defined as

to avoid the problem of dividing by zero when the

M

residuals vanish.

X

h = argmin

kz(i) ˆ

hk1 ,

(subj. to ˆ

h> ˆ

h = 1)

The L1 optimization proposed above introduces

ˆ

h

i=1

additional parameter δ in order to deal with the di-

(5)

vision by zero, but its interpretation is not clear. Us-

is robust and can be solved by the generalized

ing the Huber cost function instead of the L1 norm

Weiszfeld algorithm, an instance of IRLS. Instead

avoids the numerical issue. The weight w(i) is set as

of modifying the algorithm to take of the techni-

follows ([13]).

cal problems associated with the Weiszfeld algorithm

(

arising if one of the residuals vanishes, we instead

1

: kr(i)k1 ≤ k

w(i) =

2

optimize the response to the Huber kernel.

k/kr(i)k1

: kr(i)k1 ≥ k

2

2

Huber optimization is defined as

The additional Huber parameter k can be intuitively

M ( 1

X

kr(i)k2

: kr(i)k1 ≤ k

seen as a smoothing factor between L2

h = argmin

2

2

2

2 and L1

2 norms

or, alternatively, like a lower bound on the inlier

ˆ

k(kr(i)k1 − k )

: kr(i)k1 ≥ k

h

i=1

2

2

2

threshold.

(subj. to ˆ

h> ˆ

h = 1 and r(i) = z(i) ˆ

h)

The motivation for using this optimization is its

(6)

robustness. It is closely related to geometric me-

dian computation and the formulation is convex.It is

The minimization is carried out by a slightly

a well known property of median that it is robust to

modified Weiszfeld algorithm ([12]), an iterative

outliers for up to 50% contamination of samples by

reweighted least squares method:

the outliers. The property makes the procedure non

1: procedure IRLS OPTIMIZATION

sensitive to the choice of the inlier-outlier threshold

2:

Initialize h as the estimate obtained from the

of the “outer” RANSAC loop.

minimal sample h

3:

while stopping condition not met do

3. Experiments.

4:

Compute the geometric error r(i):

We compared the standard RANSAC, the state-

r(i) = kz(i)hk12 (∀i = 1, 2, ..., M )

(7)

of-the-art LO-RANSAC and the proposed L1-based

RANSAC on a dataset consisting of 42 image

5:

Reweight Z:

pairs, including selected images from the ZuBuD

p

z(i) ←

w(i)z(i)

(8)

dataset [4], images from Lebeda’s homog dataset [9]

(∀i = 1, 2, ..., M )

used for evaluation of the LO-RANSAC, and images

2

from the symbench dataset [6]. The Hessian Affine

In RANSAC, the error on a single point is bounded by the

inlier threshold. In practice, points close to the the inlier-outlier

feature detector with SIFT descriptor was used for

boundary make the outcome of standard RANSAC unstable.

obtaining the tentative correspondences.





Image

Qty↓

RANSAC

LO-RANSAC

L1-based RANSAC

I

953.2

±0.9

(950-956)

953.0

±0.0

(953-953)

953.0

±0.1

(952-953)

LO time (µs)

0.0

±0.0

(0-0)

29158.8 ±3383.8 (27499-42497) 3934.6 ±1035.1 (1901-6479)

I (%)

76.9

±0.1

(77-77)

76.9

±0.0

(77-77)

76.9

±0.0

(77-77)

05

Samp

11.8

±5.7

(7-35)

11.8

±5.7

(7-35)

7.5

±1.9

(7-19)

Time(ms)

6.1

35.5

13.7

Error

0.74

±0.05

(0.6-0.9)

0.72

±0.00

(0.7-0.7)

0.73

±0.01

(0.7-0.8)

LO count

0.0

±0.0

(0-0)

1.0

±0.0

(1-1)

2.2

±0.9

(1-5)

I

250.8

±1.1

(244-252)

251.0

±0.0

(251-251)

251.0

±0.0

(251-251)

LO time (µs)

0.0

±0.0

(0-0)

10922.9 ±1797.6 (8737-15292)

1318.5 ±214.1

(917-1917)

I (%)

97.6

±0.4

(95-98)

97.7

±0.0

(98-98)

97.7

±0.0

(98-98)

Samp

5.0

±2.6

(2-14)

5.0

±2.6

(2-14)

2.0

±0.3

(2-4)

adam

Time(ms)

2.3

14.4

4.4

Error

1.15

±0.45

(0.4-2.8)

0.77

±0.05

(0.6-0.8)

0.79

±0.02

(0.6-0.8)

LO count

0.0

±0.0

(0-0)

1.0

±0.0

(1-1)

1.4

±0.5

(1-2)

I

328.4

±0.5

(328-330)

328.0

±0.2

(328-329)

328.0

±0.0

(328-328)

LO time (µs)

0.0

±0.0

(0-0)

13874.9 ±2006.0 (11071-16489) 1738.3 ±323.7

(917-2428)

I (%)

86.2

±0.1

(86-87)

86.1

±0.1

(86-86)

86.1

±0.0

(86-86)

boat

Samp

6.2

±2.5

(4-15)

6.2

±2.5

(4-15)

4.1

±0.4

(4-7)

Time(ms)

2.6

17.8

5.9

Error

1.30

±0.14

(1.1-2.1)

1.23

±0.01

(1.2-1.2)

1.24

±0.00

(1.2-1.2)

LO count

0.0

±0.0

(0-0)

1.0

±0.0

(1-1)

1.7

±0.7

(1-3)

I

450.0

±3.5

(428-451)

451.0

±0.0

(451-451)

451.0

±0.0

(451-451)

LO time (µs)

0.0

±0.0

(0-0)

16342.6 ±2310.4 (13755-19648) 2094.9 ±347.1 (1090-3084)

I (%)

87.2

±0.7

(83-87)

87.4

±0.0

(87-87)

87.4

±0.0

(87-87)

Samp

8.3

±4.2

(4-22)

8.3

±4.2

(4-22)

4.1

±0.3

(4-6)

Brussels

Time(ms)

3.4

20.6

6.6

Error

1.39

±0.37

(1.1-3.3)

1.24

±0.00

(1.2-1.2)

1.24

±0.00

(1.2-1.3)

LO count

0.0

±0.0

(0-0)

1.0

±0.0

(1-1)

1.8

±0.7

(1-3)

I

840.1

±9.8

(808-848)

846.2

±0.4

(846-847)

846.0

±0.0

(846-846)

LO time (µs)

0.0

±0.0

(0-0)

24032.7 ±2219.6 (21845-29919) 4274.4 ±834.5 (1794-6007)

I (%)

89.9

±1.1

(87-91)

90.6

±0.0

(91-91)

90.6

±0.0

(91-91)

graf

Samp

7.3

±3.5

(3-20)

7.3

±3.5

(3-20)

3.2

±0.7

(3-8)

Time(ms)

4.8

29.5

11.9

Error

1.69

±0.22

(1.4-2.7)

1.45

±0.00

(1.4-1.5)

1.45

±0.01

(1.4-1.6)

LO count

0.0

±0.0

(0-0)

1.0

±0.0

(1-1)

1.7

±0.7

(1-4)

I

89.6

±2.4

(77-93)

91.0

±0.2

(91-92)

91.0

±0.2

(90-92)

LO time (µs)

0.0

±0.0

(0-0)

8090.3 ±1025.8 (7196-10973)

707.8

±131.0

(437-1009)

I (%)

48.4

±1.3

(42-50)

49.2

±0.1

(49-50)

49.2

±0.1

(49-50)

Samp

110.6

±53.7

(45-257)

54.0

±4.0

(45-67)

46.1

±14.8

(37-123)

notredame13

sym

Time(ms)

4.2

11.7

5.9

Error

1.81

±0.62

(1.1-4.7)

1.13

±0.02

(1.1-1.2)

1.15

±0.09

(1.1-1.7)

LO count

0.0

±0.0

(0-0)

1.0

±0.0

(1-1)

2.9

±1.2

(1-7)

Table 2: Results on six pairs representing well the whole dataset with the exception of cases in Tab.4. The number of inliers found (I), the inlier ratio I(%), the LO step time (LO time), the number of RANSAC samples (Samp), CPU time (time), the mean error on ground truth correspondences (Error) and the number of local optimizations (LO). Values in bold are means over 100 runs. The ± entries are standard deviations, minimum and maximum are shown in parentheses.

The blue plots represent the stability of each algorithm over 100 runs. The left one represents a probability of a tentative correspondence to be an inlier (probability on the vertical axis, correspondence index on the horizontal axis). The correspondences were sorted so that the plot is non-increasing. In the ideal case, the plot should look like a rectangle. Any other shape indicates that some of the tentative correspondences were not classified as inliers/outliers consistently over the 100 runs. The second plot is a histogram of the first plot.





First image

Second image

Thr. sensitivity

image:05

1.8

1.6

1.4

1.2

1

0.8

mean GT error [px]

5

10

15

20

inlier threshold [px]

image:boat

1.4

1.3

1.2

mean GT error [px]

5

10

15

20

inlier threshold [px]

image:sym_notredame15

2.5

2

1.5

1

mean GT error [px]

5

10

15

20

inlier threshold [px]

image:adam

3

2.5

2

1.5

1

mean GT error [px]

5

10

15

20

inlier threshold [px]

image:Brussels

1.8

1.6

1.4

mean GT error [px]

5

10

15

20

inlier threshold [px]

image:graf

2

1.8

1.6

mean GT error [px]

5

10

15

20

inlier threshold [px]

Table 3: The dependence of the ground truth error on the inlier threshold (RANSAC green, LO-RANSAC blue, L1-based RANSAC red). Note that the proposed L1 algorithm yields results very similar to LO-RANSAC. The ground truth error was averaged over 10 runs for each of the methods. Experimental results demonstrated on the same image pairs as in Table 2.

The RANSAC parameters common to all three

where S = max(w, h)/768 is a scale factor depen-

tested versions used in our experiments are sum-

dent image dimensions. The 5.99 term is the 95%

marized in table 6. The inlier threshold θ is set,

percentile of the χ2 distribution with two degrees of

following[8] given σ in the following way:

freedom.

Additional parameters used for the standard LO-

θ = 5.99 (σS)2

RANSAC are summarized in Table 7. The proposed





Image

Qty↓

RANSAC

LO-RANSAC

L1-based RANSAC

I

201.0

±12.9

(172-234)

227.2

±1.3

(224-232)

214.9

±11.4

(195-228)

LO time (µs)

0.0

±0.0

(0-0)

12567.0 ±1944.9 (10166-16251)

993.8

±187.2

(598-1650)

I (%)

60.0

±3.9

(51-70)

67.8

±0.4

(67-69)

64.1

±3.4

(58-68)

Samp

52.6

±24.8

(15-153)

42.5

±10.9

(15-59)

17.7

±5.7

(9-41)

BruggeSquare

Time(ms)

5.2

18.0

5.8

Error

3.50

±1.25

(1.2-6.2)

2.44

±0.12

(2.0-2.7)

2.93

±0.91

(1.3-4.6)

LO count

0.0

±0.0

(0-0)

1.0

±0.0

(1-1)

2.7

±1.2

(1-5)

I

11.0

±0.2

(9-11)

11.0

±0.0

(11-11)

10.9

±0.6

(7-11)

LO time (µs)

0.0

±0.0

(0-0)

1531.3

±746.7

(603-4434)

249.7

±68.9

(109-419)

y

I (%)

14.8

±0.3

(12-15)

14.9

±0.0

(15-15)

14.7

±0.8

(9-15)

Samp

10745.0 ±5429.3 (6963-27356)

8392.7 ±3432.8 (6963-25008)

8220.6 ±3301.0 (4820-19947)

dlazk

Time(ms)

87.3

76.0

71.6

Error

2.99

±0.95

(2.6-6.4)

2.61

±0.00

(2.6-2.6)

5.43

±20.63

(2.6-204.9)

LO count

0.0

±0.0

(0-0)

4.7

±1.6

(1-9)

7.0

±3.1

(2-21)

Table 4: Results on two image pairs with unusual sensitivity to the inlier threshold. See caption of Tab. 2 for description of entries.

image:BruggeSquare

6

5

4

3

mean GT error [px]

5

10

15

20

inlier threshold [px]

image:dlazky

100

80

60

40

20

mean GT error [px]

5

10

15

20

inlier threshold [px]

Table 5: The dependence of the ground truth error on the inlier threshold (RANSAC green, LO-RANSAC blue, L1-based RANSAC red) for two failure cases.

confidence

0.95

representing the results on the whole dataset, with

σ

2.0

the exception of a few cases described later. Note

sample limit

500000

that the proposed L1 optimization is usually about

5 times faster than the standard LO step (see ’LO

Table 6: RANSAC parameters

time’).Table 4 summarizes the performance on the

few exceptional cases.

ILSQ iterations

4

The error (see ’Error’ in the table) was computed

ILSQ sample limit

28

by reprojecting hand-made ground truth correspon-

threshold multiplier

4

dences (about 8 of them for each image pair) by the

inner RANSAC repetitions

10

model found by the algorithm used.

Table 7: LO-RANSAC parameters

Two observations summarize the results: i) the

proposed procedure yields error which is compara-

ble to the standard LO-RANSAC, and ii) it usually

method does not introduce any new parameters.

runs approximately 5 times faster (see ’LO time’ in

Table 2 shows a sample of six image pairs well

the table).

Table 3 shows the comparison of the dependence and Pattern Recognition (CVPR), 2012 IEEE Con-of the error on the inlier threshold for standard

ference on, pages 206–213. IEEE, 2012. 3

RANSAC, standard LO-RANSAC and the proposed

[7] P. Huber. Robust Statistics. Wiley Series in Probabil-

method. The results shown on the same subset of

ity and Statistics - Applied Probability and Statistics

six image pairs which are representative of the whole

Section Series. Wiley, 2004. 3

dataset. The experiment confirms that the proposed

[8] K. Lebeda, J. Matas, and O. Chum. Fixing the lo-

cally optimized ransac.

In R. Bowden, J. Collo-

procedure is able to achieve results similar to the

mosse, and K. Mikolajczyk, editors, Proceedings of

standard LO-RANSAC.

the British Machine Vision Conference, pages 1013–

The results for two exceptional image pairs are

1023, London, UK, September 2012. BMVA. 1, 2,

shown in table 5.

The standard LO-RANSAC

5

achieves good results (high stability, low error),

[9] K. Lebeda, J. Matas, and O. Chum. Fixing the lo-

while our proposed algorithm fails to stabilize the

cally optimized RANSAC – Full experimental eval-

plain RANSAC results (the ’dlazky’ pair is one of

uation. Research Report CTU–CMP–2012–17, Cen-

the most challenging ones from our dataset, as there

ter for Machine Perception, Czech Technical Univer-

are only 11 inliers).

sity, Prague, Czech Republic, September 2012. 3

[10] R. Raguram, O. Chum, M. Pollefeys, J. Matas, and

4. Conclusions

J. Frahm. Usac: A universal framework for random

sample consensus. Pattern Analysis and Machine In-

We have shown that replacing the standard LO

telligence, IEEE Transactions on, 35(8):2022–2038,

step of LO-RANSAC with minimization of the sum

Aug 2013. 1

of Huber kernel responses to residuals has the fol-

[11] B. Tordoff and D. W. Murray. Guided sampling and

lowing properties: it is simple, deterministic and pro-

consensus for motion estimation. In Computer Vi-

duces similar errors as the standard LO-RANSAC

sionECCV 2002, pages 82–96. Springer, 2002. 1

and is usually approximately 5 times faster. On the

[12] E. Weiszfeld and F. Plastria. On the point for which

negative side, in the current implementation, it has

the sum of the distances to n given points is mini-

mum. Annals of Operations Research, 167(1):7–41,

higher probability of failure than the standard LO-

2009. 3

RANSAC.

[13] Z. Zhang. Parameter estimation techniques: A tu-





Acknowledgements


torial with application to conic fitting. Image and

vision Computing, 15(1):59–76, 1997. 3

The research was supported by CTU student grant

SGS15/155/OHK3/2T/13.

References

[1] A. Beck and S. Sabach. Weiszfeld’s method: Old

and new results. Journal of Optimization Theory and

Applications, 164(1):1–40, 2014. 2

[2] O. Chum, J. Matas, and J. Kittler.

Locally opti-

mized ransac. In Pattern Recognition, pages 236–

243. Springer, 2003. 1

[3] M. A. Fischler and R. C. Bolles. Random sample

consensus: A paradigm for model fitting with ap-

plications to image analysis and automated cartog-

raphy. Commun. ACM, 24(6):381–395, June 1981.

1

[4] L. V. G. H. Shao, T. Svoboda. Zubud - zurich build-

ings database for image based recognition. Techni-

cal report, 2003. 3

[5] R. Hartley and A. Zisserman. Multiple View Geome-

try in Computer Vision. Cambridge University Press,

Cambridge, UK, 2000. Chapter 3 : Estimation - 2D

Projective Transformations. 1, 2

[6] D. C. Hauagge and N. Snavely. Image matching us-

ing local symmetry features. In Computer Vision