Univerza v Ljubljani Fakulteta za elektrotehniko
Matej Kristan
Sledenje ljudi v video posnetkih s pomoˇcjo verjetnostnih modelov
DOKTORSKA DISERTACIJA
Mentor: prof. dr. Stanislav Kovačič Somentor: prof. dr. Aleš Leonardis
Ljubljana, 2008
University of Ljubljana Faculty of Electrical Engineering
Matej Kristan
Tracking people in video data using probabilistic models
Ph.D. Thesis
Supervisor: prof. Stanislav Kovačič, Ph.D. Cosupervisor: prof. Aleš Leonardis, Ph.D.
Ljubljana, 2008
Abstract
In this thesis we focus on probabilistic models for tracking persons in visual data. Tracking is defined within the context of probabilistic estimation, where the parameters of the target’s model are considered random variables and the aim is to estimate, recursively in time, the posterior probability density function over these parameters. The recursive estimation is approached within the established Bayesian framework of particle filtering. Several aspects of tracking persons are considered in this thesis: how to build a reliable visual model of a person, how to efficiently model the person’s dynamics and how to devise a scheme to track multiple persons.
One of the essential parts of visual tracking is the visual model, which allows us to evaluate whether a person is located at a given position in the image. We propose a color-based visual model, which improves tracking in situations when the background color is similar to the color of the tracked person. The proposed color-based visual model applies a novel measure which incorporates the model of the background to determine whether the tracked target is positioned at a given location in the image. A probabilistic model of the novel measure was derived, which allows using the color-based visual model with the particle filter. The visual model does not require a very accurate model of the background, but merely a reasonable approximation of it. To increase robustness to the color of the background, a mask function is automatically generated by the visual model to mask out the pixels that are likely to belong to the background. A novel adaptation scheme is applied to adapt the visual model to the current appearance of the target. The experiments show that the proposed visual model can significantly improve tracking in situations when the color of the tracked person is similar to the background and can handle short-term occlusions between persons of different color. However, tracking still fails when a person gets in a close proximity of a visually similar object or when it is occluded by that object. The reason is that the ambiguity in the visual information is too large and cannot be resolved even with a good dynamic model.
To better cope with the visual ambiguities associated with the color of the tracked person, we propose a combined visual model, which fuses the color information with the local motions in the person’s appearance. The local-motion feature is calculated from a sparse estimate of the optical flow, which is evaluated in images only at locations with enough texture. A probabilistic model of the local-motion is derived which accounts for the errors in the optical flow estimation as well as for the rapid changes in the target’s motion. The local-motion model is probabilistically combined with the color-based model into a combined visual model using an assumption that the color is conditionally independent of motion. An approach is also developed to allow adaptation of the local-motion model to the target’s motion.
To better describe the dynamics of a moving person and improve estimation of person’s position and prediction, we propose a novel dynamic model, which we call the two-stage dynamic model, and the corresponding two-stage probabilistic tracker. The two-stage dynamic model is composed of a liberal and conservative dynamic model. The liberal model allows larger perturbations in the target’s dynamics and is used within the particle filter to efficiently explore the state space of the target’s parameters. This model is derived by modelling the target’s velocity with a non-zero-mean Gauss-Markov process and can explain well motions ranging from a complete random-walk to a nearly-constant-velocity. The conservative model imposes stronger restrictions on the target’s velocity and is used to estimate the mean value of the Gauss-Markov process in the liberal model, as well as for regularizing the estimated state from the particle filter. We give a detailed analysis of the parameters of the two-stage dynamic model, and also derive an approach to setting the spectral density of the liberal model.
The proposed solutions for tracking a single person are extended to tracking multiple persons. A context-based scheme for tracking multiple targets from a bird’s-eye view is proposed, which simplifies the Bayes recursive filter for multiple targets and allows tracking with a lower computational complexity. In the context of observing the scene from a bird’s-eye view, the recorded images can be partitioned into regions, such that each region contains only a single target. This means that, given a known partitioning, the Bayes filter for tracking multiple targets can be simplified into multiple single-target trackers, each confined to the corresponding partition in the image. A parametric model of the partitions is developed, which requires specifying only the locations of the
tracked targets. Since the partitions are not known prior to a given tracking iteration, a scheme is derived which iterates between estimating the targets’ positions and refining the partitions. Using this scheme we simultaneously estimate the locations of the targets in the image as well as the unknown partitioning.
Key words:
Computer vision; Probabilistic models; Tracking persons; Video data; Bayes recursive filter; Particle filters; Color-based model; Local motion; Two-stage dynamic model; Multiple targets
Acknowledgements
First of all I would like to express my sincere thanks to my supervisor, prof. dr. Stanislav Kovaˇciˇc, and my co-supervisor, prof. dr. Aleˇs Leonardis, who have guided me during my studies and have provided a vibrating environment for my research. A big thank you to Janez Perˇs for his guidance in the early stages of my postgraduate studies and discussions which have broadened my horizon in the field of computer vision.
To Igor, Tanja and Katja, thanks for the encouragement, belief, and everything else that comes with a worm, supporting, family. Thank you Urˇsa for sticking with me from my diploma thesis, through masters right up to the doctoral thesis. You are the main reason why the Slovenian parts of those theses appear to be written in proper Slovene and with proper punctuation.
I also thank my colleagues at the Machine Vision Group, the Visual Cognitive Systems group, and the Laboratory of Metrology and Quality, many of whom have offered interesting discussions and comments on mine and their professional work. I especially thank Miha Hiti and Barry Ridge for proofreading parts of my thesis. Thanks also to all my friends who have been a great and relaxing company, and who are too numerous to be listed here – you know who you are.
I do not thank, however, to MAXDATA, who installed a shady disk in my laptop which almost caused me to lose my thesis – I thank the Norton Ghost for never having to think about that dreadful event again. And I thank Igor for helping me out the last time the damn laptop crashed.
Matej Kristan Ljubljana May 2008
Contents
Povzetek                                                                                                        xi
Opis oˇzjega znanstvenega podroˇcja  ....................    xii
Vizualne znaˇcilnice ..........................   xiv
Dinamiˇcni modeli ...........................   xviii
Metode za ohranjanje identitet veˇcih tarˇc ..............   xix
Izvirni znanstveni prispevki   ........................   xxi
Podrobnejˇsi pregled vsebine  ........................   xxii
1   Introduction                                                                                              1
1.1    Related work   .............................      3
1.1.1    Visual cues   ..........................      3
1.1.2    Dynamic models   .......................      9
1.1.3    Managing multiple targets ..................    11
1.2    Contributions .............................    12
1.3    Thesis outline and summary .....................    14
2   Recursive Bayesianfiltering                                                                 17
2.1    Tracking as stochastic estimation   ..................    18
2.2    Recursive solution ...........................    19
2.3    Bayes filter for a stochastic dynamic system   ............    21
2.4    Historical approaches to recursive filtering   .............    23
2.5    Monte-Carlo-based recursive filtering  ................    26
vii
2.5.1    Perfect Monte Carlo sampling ................    26
2.5.2    Importance sampling .....................    28
2.5.3    Sequential importance sampling ...............    29
2.5.4    Degeneracy of the SIS algorithm ...............    31
2.5.5    Particle filters .........................    34
3   Color-based tracking                                                                             41
3.1    Color histograms   ...........................    42
3.1.1    Color-based measure of presence ...............    43
3.2    The likelihood function ........................    45
3.3    The background mask function   ...................    47
3.3.1    Implementation of dynamic threshold estimation ......    48
3.4    Adaptation of the visual model   ...................    49
3.5    Color-based probabilistic tracker ...................    52
3.6    Experiments ..............................    54
3.7    Conclusion ...............................    58
4   Combined visual model                                                                         61
4.1    Optical flow ..............................    62
4.1.1    Calculating the sparse optical flow   .............    65
4.2    Optical-flow-based local-motion feature ...............    68
4.2.1    Local-motion likelihood ....................    68
4.2.2    Adaptation of the local-motion feature  ...........    69
4.3    The combined probabilistic visual model  ..............    70
4.4    Experiments ..............................    70
4.5    Conclusion ...............................    76
5   Atwo-stage dynamic model                                                                 79 5.1   The liberal dynamic model ......................    80
viii
5.1.1    Parameter ß ..........................    83
5.1.2    Selecting the spectral density   ................    86
5.2    The conservative dynamic model   ..................    87
5.3    A two-stage dynamic model   .....................    89
5.4    Experimental study ..........................    92
5.4.1    Experiment 1: Tracking entire persons   ...........    92
5.4.2    Experiment 2: Tracking person’s hands ...........    99
5.5    Conclusion ...............................   101
6   Tracking multiple interacting targets                                               103
6.1    Using the physical context ......................   104
6.2    Parametric model of partitions ....................   105
6.3    Context-based multiple target tracking ...............   106
6.4    Experimental study ..........................   108
6.4.1    Description of the recordings .................  108
6.4.2    Results .............................  111
6.5    Conclusion ...............................  116
7   Conclusion                                                                                             119
7.1    Summary of contributions  ......................  122
7.2    Future work ..............................  123
References                                                                                                  125
AppendixA                                                                                                141
A.1  Selection of the likelihood function   .................  142
AppendixB                                                                                                145
B.1   Random-walk dynamic model  ....................  147
B.2   Nearly-constant-velocity dynamic model   ..............  147
ix
Biography                                                                                                   149
Published work                                                                                          153
Izjava                                                                                                           157
X
Povzetek
Sledenje ljudi v video posnetkih s pomoˇcjo verjetnostnih modelov
V disertaciji se ukvarjamo z verjetnostnimi modeli za sledenje oseb v video podatkih. Parametre modela osebe obravnavamo kot slučanje spremenljivke in sledenje definiramo v kontekstu statističnega ocenjevanja. Tako zastavljen problem potem rešujemo z rekurzivnim časovnim ocenjevanjem a posteriori funkcije porazdelitve gostote verjetnosti preko vrednosti parametrov. Rekurzivno ocenjevanje rešujmo z uveljavljenimi verjetnostnimi Bayesovimi metodami imenovanimi filtri z delci (angl., particle filters). Kadar sledimo s filtri z delci, je uspešnost sledenja močno odvisna od treh pomembnejših sestavnih delov sledilnika: Prvi sestavni del je verjetnostni vizualni model za lokalizacijo tarče v sliki s pomočjo njenih vizualnih lastnosti. Drugi sestavni del je verjetnostni dinamični model za opisovanje dinamike tarče. Ta model določa kako se parametri modela tarče spreminjajo skozi čas. Tretji sestavni del sledilnika je metoda za ohranjanje identitete več tarč, ki je še posebej pomemben kadar med tarčami prihaja do trkov. V disertaciji predlagamo izboljšave vseh treh sestavnih delov sledilnika. V nadaljevanju bomo najprej podali opis ožjega znanstvenega področja, nato bomo navedli prispevke k znanosti, sledil pa bo natančnejši pregled disertacije s poudarkom na prispevkih k znanosti.
xi
xii
Povzetek
Opis ožjega znanstvenega področja
Sledenje ljudi v video posnetkih je del širšega področja računalniškega vida, s katerim se je v zadnjih dvajsetih letih ukvarjalo mnogo raziskovalcev. Rezultat teh raziskav je množica literature, katere preglede lahko najdemo v delih avtorjev kot so Aggarval in Cai [2], Gavrila [51], Gabriel et al. [49], Hu et al. [60] in Moeslund et al. [114, 115]. Sledenje z metodami računalniškega vida je našlo mesto v mnogih aplikacijah. Med njimi so:
•  Video   nadzorovanje,   kjer je  namen  slediti  avtomobile  in ljudi  za detekcijo nenavadnega obnašanja.
•  Video editiranje, kjer je namen vključevanje grafičnih vsebin v video posnetkih preko gibajočih se objektov (oseb).
•  Analiza športnih iger na podlagi trajektorij pridobljenih s sledenjem igralcev med tekmo.
•   Sledenje laboratorijskih živali kot so insekti in glodalci, kjer je cilj raziskovati naravne več-agentne sisteme.
•  Vmesniki za komunikacijo človek-stroj v inteligentnih ambientih za pomoč pri človekovih vsakodnevnih opravilih.
•   Spoznavni  sistemi,   ki  uporabljajo  sledenje  za učenje o  dinamičnih lastnosti opazovanih objektov.
Poglavitni problem sledenja v video posnetkih je negotovost, ki je povezana z vizualno informacijo, in negotovost v dinamiki sledenih objektov. Naraven način kako upoštevati te negotovosti je obravnavanje problema sledenja v kontekstu statističnega ocenjevanja stanja (npr. položaja) tarče skozi čas. Natančneje, znanje o trenutnemu stanju tarče predstavimo kot funkcijo gostote porazdelitve verjetnosti (angl. probability density function) (pdf) v prostoru stanj tarče. Sledenje tako obravnavamo kot problem rekurzivnega ocenjevanja a posteriori pdf tarče ob vsakem časovnem koraku z upoštevanjem trenutnih meritev. Ob predpostavki, da lahko dinamiko tarče in proces merjenja opišemo z linearnimi Gaussovimi procesi, lahko oceno a posteriori pdf izračunamo analitično preko znanega Kalmanovega filtra [76]. Predpostavke, ki jih naredi Kalmanov filter, so pogosto preveč idealizirane za vizualno sledenje, rezultat pa je poslabšano
Povzetek
xiii
delovanje ali celo pogosto odpovedovanje sledilnika. Da bi lahko obravnavali bolj realne probleme, je bilo v literaturi predlagano mnogo izboljšav, vendar pa le-te niso bile sposobne modelirati povsem poljubnih porazdelitev, ki se lahko pojavijo v vizualnem sledenju. V poznih devetdesetih sta Isard in Blake [64] predstavila metodo imenovano algoritem Condensation za učinkovito računanje a posteriori pdf tarče, ki ni vsebovala tako omejujočih predpostavk kot Kalmanov filter. Ta metoda spada v širši razred sekvenčnih metod Monte Carlo, znanim pod skupnim imenom filtri z delci (angl.  particle filters) [6, 43].
V  nasprotju s Kalmanovim filtrom, filtri z delci ne predpostavljajo Gaussove a posteriori pdf tarče, pač pa predstavijo porazdelitev z diskretnim naborom vzorcev (delcev). Iteracija sledenja je tako sestavljena iz dveh korakov. V prvem koraku se simulira gibanje delcev preko predložne (angl. proposal) porazdelitve. Nato se v drugem koraku vsakemu delcu dodeli utež na podlagi dinamičnega modela in funkcije verjetja (angl. likelihood function). Predložna porazdelitev lahko služi kot vnos pomožne informacije za usmerjanje delcev v področja prostora stanj z večjo verjetnostjo. Pogosto pa dodatna pomožna informacija ni na voljo in v takih primerih lahko za predložno porazdelitev uporabimo kar dinamični model. Rezultat je opasan filter z delci (angl. bootstrap particle filter) [53], ki je tudi najbolj uporabljan med vsemi različicami teh filtrov.
Učinkovitost filtra z delci je odvisna predvsem od sledečih delov:
•  Vizualne značilnice,  ki so uporabljene za opisovanje vizualnih lastnosti tarče.
•  Dinamični model, ki opisuje dinamiko gibanja tarče.
•  Sistem za ohranjanje identitet tarč, kadar sledimo več kot eno tarčo.
Ta disertacija se osredotoča na zgoraj navedene tri dele. Slednje obravnavamo v kontekstu verjetnostnega sledenja oseb v video podatkih. Glavni prispevki se nanašajo na verjetnostne modele vizualnih značilnic, verjetnostne dinamične modele in verjetnostne sheme za ohranjanje identitet pri sledenju več tarč.
V  nadaljevanju tega poglavja bomo najprej podali pregled literature z ožjega znanstvenega področja na katerega se nanašajo prispevki disertacije.
xiv
Povzetek
Vizualne značilnice
Z vizualnimi značilnicami modeliramo vizualno informacijo v slikah in jih med sledenjem uporabimo za lokalizacijo sledenih objektov. Glede na tip vizualne informacije, lahko te modele razdelimo v modele oblike, modele izgleda in na gibanju temelječe modele.
Modeli oblike
Eden od zgodnjih pristopov k modeliranju oblike so temeljili na fleksibilnih krivuljah ali kačah (angl. snakes) [148], ki so se iterativno prilegale robovom objekta. Glavna pomanjkljivost teh metod je bila njihova občutljivost na šum v podatkih. Zato so kače pogosto odpovedovale, npr. ko je med objekti prihajalo do zakrivanj ali kadar se je objekt nahajal na ozadju z mnogo robovi. Kadar imamo opravka z objekti, ki se po obliki bistveno ne razlikujejo, lahko uporabimo modele s porazdeljenimi točkami (angl. point distribution models) (PDM) [37]. Ti modeli so bili uspešno uporabljeni tako za modeliranje oblik uporov [36] kakor tudi pešcev [50]. Modeli PDM predpostavljajo, da lahko po obodu objektov, ki jih modeliramo, izberemo enolično množico točk. Iz velike množice tako označenih objektov lahko dobimo kompakten zapis objekta preko metode glavnih komponent (angl. principal component analysis) (PCA). Končni model je tako sestavljen iz podprostora točk, ki ga podpira majhno število dominantnih smeri variacije.
Aktivni modeli oblike (angl. active shapes models) [19] ki temeljijo na B-zlepkih s kontrolnimi točkami razmeščenimi enakomerno po obodu objekta so bili uporabljeni za določanje pričakovane oblike pešca v aplikaciji vizualnega nadzorovanja [12], kakor tudi sledenja lista na grmovju [19]. Obliko aktivne konture so Blake et al. [19] omejili na specifične oblike objekta z določitvijo funkcije gostote verjetnosti preko prostora oblik. Eden od pomembnih parametrov aktivne konture, ki je v splošnem odvisen od objekta, je število kontrolnih točk za B-zlepke. Preveč točk lahko naredi model prekompleksen in nestabilen, medtem ko je rezultat modeliranja s premalo točkami lahko preveč poenostavljena oblika, ki ni primerna za sledenje. Da bi se izognili vplivu parametrizacije konture, so Malladi et. al [108] predlagali uporabo nivojskih množic (angl. level sets). Bistvo nivojskih funkcij je v tem, da eksplicitno modeliranje krivulje prevedemo v modeliranje višje dimenzionalne
Povzetek
XV
nivojske funkcije. Rezultat te nivojske funkcije ob konstantnem nivoju je kontura. Ena prednost nivojskih funkcij pred aktivnimi konturami je njihova sposobnost modeliranja topolˇskih sprememb v obliki predmeta. Primer sledenja ljudi z nivojskimi funkcijami lahko najdemo v [38].
V primerih, ko so sledeni objekti opisani z majhnim ˇstevilom slikovnih elementov, ali se po obliki hitro spreminjajo, zgoraj opisani postopki niso primerni za njihovo predstavitev. Perˇs in Kovaˇciˇc [125] sta predlagala 14 binarnih, Walshovim funkcijam podobnih, jeder za robusten zapis oblike igralca rokometa med gibanjem po igriˇsˇcu. Jedra sta uporabila za zapis igralca v enem ˇcasovnem koraku ter za lokalizacijo istega igralca v naslednjem koraku. Needham [116] je predlagal pet v naprej nauˇcenih veˇc-resolucijskih jeder za opis igralcev nogometa. Jedra je doloˇcil iz velike mnoˇzice v naprej segmentiranih binarnih slik igralcev. Dimitrijeviˇc et al. [42] so uporabili posebno opremo za zajem gibanja in pridobili veliko mnoˇzico sekvenc oblik ljudi med hojo. Iz sekvence oblik so doloˇcili predloge za detekcijo kljuˇcnih poz ljudi med hojo. Kljuˇcna poza je bila doloˇcena kot tista poza, ko ima oseba obe nogi na tleh in je kot med nogami najveˇcji. Robustnost detekcije so izboljˇsali z upoˇstevanjem veˇcih zaporednih predlog.
Dalal in Triggs [40] sta predstavila postopek, kjer sta obliko ljudi zapisala s histogrami orientiranih gradientov (angl. histograms of oriented gradients) (HOG). Sliko sta najprej razdelila v manjˇse celice in za vsako celico zgradila 1D histogram smeri gradientov. Ti gradienti so sluˇzili kot znaˇcilnice za predstavitev vsebine znotraj poljubnega pravokotnega podroˇcja. Metoda podpornih vektorjev (angl. support vector machine)(SVM) je uporabljena za ugotavljanje ali se znotraj nekega pravokotnega podroˇcja nahaja oseba. Lu in Little [102] sta uporabila HOGe za sledenje in detekcijo akcij igralcev hokeja. Izgled vsakega igralca posebej sta zapisala s svojim HOGom in uporabila filter z delci za generiranje novih moˇznih poloˇzajev igralcev v naslednji sliki. Sledene igralce sta poiskala v novi sliki preko primerjanja referenˇcnih HOGov s tistimi, ki sta jih izraˇcunala na generiranih poloˇzajih. Zhao in Thorpe [174] sta predlagala uporabo gradientov izraˇcunanih iz silhuet pridobljenih iz globinskih slik. Avtorja sta uporabila nevronsko mreˇzo za verifikacijo, ˇce neka silhueta pripada ˇcloveku.
Ena od slabih strani na obliki temeljeˇcih modelov je, da ne upoˇstevajo barve objektov. Zato ti modeli ne morejo slediti objektov v prisotnosti drugih objektov istega razreda, ˇcetudi so slednji razliˇcnih barv. Problem modelov, ki eksplicitno opisujejo obliko objekta, je tudi v njihovi gradnji. Precej pozornosti je namreˇc
xvi
Povzetek
treba nameniti dejstvu, da je za primeren model potrebno čim bolj zaobjeti variabilnost razreda oblik, ki jim želimo slediti. Poleg tega zajemanje oblik zahteva uporabo specializirane programske in strojne opreme.
Modeli izgleda
Zgodnji pristopi k sledenju na podlagi izgleda so temeljili na tako imenovanih barvnih predlogah [62, 142]. Barvne predloge predstavijo sledeni objekt s pravokotno matriko slikovnih elementov in funkcijo maskiranja, ki določa kateri elementi pripadajo objektu in kateri ne. Predloga se izračuna na znanem položaju objekta v eni sliki in se uporabi za njegovo lokalizacijo v naslednji sliki. Senior [141] je uporabil nekoliko kompleksnejši adaptivni statistični model izgleda za sledenje v vizualnem nadzorovanju. Tudi ta pristop temelji na predstavitvi sledenega objekta s pravokotno matriko elementov, razlika pa je v tem, da se barva vsakega elementa modelira z Gaussovo porazdelitvijo. Na podoben način obravnavajo izračun funkcije maskiranja. Lim et al. [100] modelirajo izgled ljudi s pravokotnimi regijami in hkrati modelirajo dinamiko sprememb izgleda. To dosežejo s projekcijo slikovnih elementov znotraj regije v nizkodimenzionalni podprostor preko algoritma nelinearne lokalno linearne podpore (angl. local linear embedding). V tem podprostoru se nato naučijo dinamičnega modela izgleda človeka med hojo. Jepson et al. [68] se lotevajo problema spreminjajočega izgleda z modeliranjem izgleda s tremi komponentami: počasi spreminjajoče, hitro spreminjajoče in šumne komponente. Med sledenjem vse tri komponente sproti prilagajajo trenutnim spremembam z algoritmom za maksimizacijo pričakovanih vrednosti (angl. expectation maximization) (EM).
Utsumi and Tetsutani [158] uporabljata a priori znanje o izgledu za detekcijo ljudi v slikah. Vhodno sliko razdelita v manjše celice in primerjata variance ter srednje vrednosti svetlosti med bližnjimi celicami. Detekcija ljudi temelji na predpostavki, da se te vrednosti malo spreminjajo med sosednjimi celicami v slikah, ki vsebujejo ljudi, in bolj v slikah brez ljudi. V aplikaciji sledenja v športu Ok et al. [119] predpostavljajo, da lahko igralca kompaktno opišemo z dvema barvama: barvo majice in barvo hlač. Avtorji zato igralca razdelijo v dve regiji in vsako regijo opišejo z njeno povprečno barvo. Wren et al. [169] so predstavili sistem Pfinder, ki segmentira človeka v skupino mehurčkov (angl. blobs) in vsak mehurček opiše z elipso in povprečno barvo. Vendar ta sistem deluje le v precej kontroliranih pogojih in kadar se v prostoru nahaja zgolj ena
Povzetek
xvii
oseba. Robustnejˇsi pristop je uporaba specializiranih detektorjev za detekcijo posameznih delov telesa [113, 131]. Te detekcije se lahko nato s pomoˇcjo znane topologije telesa uporabijo za izgradnjo statistiˇcnega modela za detekcijo ljudi v slikah. Slabost teh pristopov je v tem, da realna okolja vsebujejo mnogo okonˇcinam podobnih struktur, kar moˇcno poveˇca nezanesljivost detekcije.
Pogosto uporabljen pristop k modeliranju barvnega izgleda so barvni histogrami [153]. Slednji so bili pogosto uporabljeni v aplikacijah vizualnega sledenja [60, 123, 118, 162, 35, 120, 34, 128]. Birchfield in Rangarajan [14] sta predlagala razred barvnih histogramov, ki vsebuje tudi prostorsko informacijo o barvi. To doseˇzeta z beleˇzenjem prostorske informacije o barvah posameznih celic v histogramu. Drugi, precej popularen pristop k modeliranju izgleda, je uporaba parametriˇcnih modelov kot so meˇsanice Gaussov (angl. mixtures of Gaussians) (MoG) [112, 78, 80, 172]. Nedavno so Tuzel et al. [156] predstavili kovarianˇcni zapis modela izgleda. V njihovem pristopu vsak slikovni element v pravokotni regiji, ki opisuje objekt, predstavijo z naborom znaˇcilnic. Te znaˇcilnice so lahko svetlostne, gradientne, itd. Model izgleda se nato zgradi preko kovarianˇcne matrike znaˇcilnic izraˇcunanih preko vseh slikovnih elementov objekta. Objekte detektirajo s primerjanjem kovariance v dani regiji z referenˇcno kovarianco. V ta namen uporabljajo razdaljo, ki temelji na posploˇsenih lastnih vrednostih.
Vizualni modeli, ki smo jih opisovali do sedaj, v glavnem temeljijo na oblikah, barvi ali svetlostnih gradientih sledenega objekta v sliki. Ker ti modeli neposredno kodirajo informacijo o svetlostih slikovnih elementov, ne morejo dobro razlikovati med vizualno podobnimi objekti, kadar se ti gibljejo blizu ali se zakrivajo. Drugaˇcen pristop je torej uporaba znaˇcilnice, ki ne opisuje neposredno svetlostne informacije. Taka znaˇcilnica je npr. gibanje slikovnih elementov.
Modeli temelječi na gibanju
Sidenbladh in Black [144] sta predstavila metodo, ki upošteva odzive različnih filtrov in se uči statistike gibanja ter izgleda iz velike množice primerov delov telesa. To metodo uporabljata za določanje človeške poze. Primer sledenja, ki temelji popolnoma na optičnem toku, sta predstavila Du in Piater [44]. Avtorja uporabljata Kanade-Lucas-Tomasijev sledilnik točk [103] v filtru z delci. Tarče identificirata v vsaki sliki z rojenjem podobnih optičnih tokov. Podoben pristop sta uporabila Pundlik in Birchfield [130], ki uporabljata kriterij afine
xviii
Povzetek
konsistentnosti za rojenje vektorjev optičnega toka. Nedavno sta Brostow in Cipolla [25] predlagala metodo, ki uporablja optični tok za izločanje stabilnih trajektorij točk v zaporedju slik. Slednje nato rojijo z metodo najkrajšega zapisa (angl. minimum description length) (MDL), rezultat pa so neodvisno gibajoča se telesa.
Vse zgornje metode uporabljajo rojenje optičnega toka za določanje objektov v slikah. Zaradi tega opisane metode ne morejo vzdrževati pravih identitet objektov kadar se slednji zakrivajo - četudi so objekti različnih barv.
Dinamični modeli
Medtem, ko vizualni modeli opisujejo vizualne značilnosti sledenih objektov, dinamični modeli opisujejo njihovo gibanje. Znanje o dinamiki gibanja objekta lahko močno zmanjša prostor možnih vrednosti parametrov stanja objekta, ki jih je potrebno ocenjevati med sledenjem. To lahko pomaga pri razreševanju dvoumnosti vizualnih podatkov, in lahko zmanjša računsko kompleksnost sledenja. Predvsem zaradi naštetih razlogov so dinamični modeli pogosto uporabljani pri ocenjevanju človeške poze med gibanjem. Sidenbladh et al. [145] uporabljajo močan a priori model hoje za ocenjevanje možnih smeri gibanja sledene osebe. Modela gibanja se naučijo iz velike baze označenih primerov. Agarwal in Triggs [1] uporabljata nabor modelov drugega reda za sledenje artikuliranega gibanja ljudi med hojo in tekom. Urtasun et al. [157] uporabljajo metodo skritih spremenljivk skaliranih Gaussovih procesov (angl. scaled Gaussian process latent variable) z vgrajeno dinamiko za učenje nizko dimenzionalne podpore v prostoru poz igralca golfa med zamahom in prostoru oblik človeka med hojo.
Nekateri avtorji so predlagali uporabo večih povezanih modelov (angl. interacting multiple models) (IMM) za opisovanje različnih tipov dinamike gibanja objektov. Ta pristop temelji na uporabi večih sledilnikov hkrati, kjer vsak sledilnik uporablja drugačen dinamični model za sledenje istega objekta. S posebnim postopkom, ki določa kako dobro vsak od modelov opisuje trenutno gibanje objekta, se rezultati sledenja posameznih sledilnikov kombinirajo v skupno oceno stanja tarče [10]. Metode IMM, ki temeljijo na Kalmanovem filtru so bile predvsem uporabljane v radarskem sledenju letal [98, 9]. Primer aplikacije vodenja pogleda kamere najdemo v [23]. Zaradi omejitev Kalmanovega
Povzetek
xix
filtra so nekateri avtorji [111, 20] uporabili metode IMM v kombinaciji s filtri z delci. Slabost metod IMM je v tem, da se prostor verjetnosti precej poveča v primerjavi z metodami, ki uporabljajo zgolj en dinamični model, saj je potrebno ocenjevati gostoto porazdelitve verjetnosti preko vseh, ne le enega modela. Pri filtrih delci je potrebno izračunati vrednost funkcije verjetja za vsako generirano hipotezo (delec) posebej. To je v aplikacijah vizualnega sledenja navadno časovno-potratna operacija, saj je potrebno zgraditi vizualni model za vsak delec posebej in ga primerjati z referenčnim modelom. Časovna zahtevnost vizualnega sledenja s filtri z delci se tako znatno poveča ob uporabi metod IMM.
V mnogih aplikacijah (npr. sledenje v športu, vizualni vmesniki človek-stroj za razpoznavanje gest, sledenje obraza in vizualno nadzorovanje) je težko določiti kompakten nabor pravil, katerim se podreja dinamika tarče. Zaradi tega in računske zahtevnosti metod IMM večina raziskovalcev uporablja zgolj en model za opisovanje dinamike. Klasična izbira je model naključnega prehoda (angl. random walk) (RW) ali model skoraj konstantne hitrosti (angl. nearly-constant velocity) (NCV). Dober opis teh modelov najdemo v [136]. Model RW najbolje opisuje gibanje tarče kadar slednja nenadoma spreminja smer gibanja ali stoji pri miru. Kadar pa se tarča giblje približno enakomerno v neki smeri (kar je značilno za aplikacije sledenja v športu in nadzorovanju), daje model RW slabe rezultate in gibanje bolje opišemo z modelom NCV. Torej, z namenom pokriti širši spekter gibanja, raziskovalci po navadi izberejo en model, RW ali NCV, in mu povečajo procesni šum. Vendar, če želimo doseči dovolj gosto pokritost prostora verjetnosti in s tem zadovoljivo sledenje, je potrebno povečati število delcev v filtru z delci. To poveča število potrebnih izračunov funkcije verjetja, kar upočasni sledenje.
Metode za ohranjanje identitet več tarč
Kadar sledimo več tarč naenkrat se pojavi netrivialen problem ohranjanja pravilne identitete posamezne tarče. Klasičen pristop v teoriji ocenjevanja in vodenja je detekcija vseh možnih kandidatov tarč ter asociacija detekcij s sledenimi tarčami. Standardni pristop k reševanju problema asociacije sta asociacija z najbližjim sosedom (angl. nearest neighbor) (NN) in verjetnostna hkratna asociacija (angl. joint probabilistic data association) (JPDA) [56]. Uporabo NN in JPDA filtrov na primerih sledenja v športu najdemo v [171, 66, 30] ter [77]. Zgodnejše primere uporabe JPDA filtrov v računalniškem vidu najdemo v [132, 138].    Vsi ti pristopi temeljijo na eksplicitni detekciji možnih tarč in
XX
Povzetek
zahtevajo izčrpno naštevanje vseh možnih asociacij med tarčami ter detekcijami. To pripelje do problema s kompleksnostjo NP (angl. NP-complete). Nekateri avtorji zato poskušajo zmanjšati število možnih asociacij na vsakem koraku tako, da za vsako tarčo upoštevajo le najbližje detekcije (angl. gating) [171, 66, 30]. Hue et. al [61] obravnavajo vektor asociacij kot slučajno spremenljivko, katere trenutno vrednost določijo preko vzorčenja z Gibbsovim vzorčevalnikom.
Drugačen pristop k reševanju problema sledenja večih tarč je obravnavanje stanj posameznih tarč kot eno samo skupno stanje. Tak pristop omogoča uporabo obstoječih rešitev v kontekstu filtrov z delci [123, 116]. Isard et al. [65] so predlagali razširiti skupno stanje z dodatno slučajno spremenljivko, ki predstavlja število opaženih tarč. Postopek so demonstrirali na primeru sledenja spreminjajočega se števila tarč. Ta pristop so uporabili Czyz et al. [39] za sledenje igralcev nogometa. Slabost metod, ki uporabljajo skupno stanje je v tem, da praviloma slaba ocena že ene od tarč pokvari celotno oceno vseh tarč. Zato je potrebno zelo povečati število delcev v filtru z delci, kar precej upočasni sledenja in zaradi česar je tak sledilnik primeren za sledenje le majhnega števila tarč [81]. Za reševanje tega problema so nekateri avtorji [175, 81] nedavno predlagali učinkovitejše sheme, ki temeljijo na metodah Monte Carlo z Markovimi verigami (angl. Markov Chain Monte Carlo) (MCMC). Vermaak et al. [162] so predstavili sledenje večih vizualno podobnih tarč kot problem ohranjanja modusov v a posteriori porazdelitvi preko vseh tarč. Ta pristop so kasneje uporabili Okuma et al. [120] in Cai et al. [30] za sledenje igralcev hokeja.
Kadar poznamo število tarč, je preprosta rešitev kar sledenje vsake tarče s svojim sledilnikom. Tak pristop zmanjša kompleksnost problema, saj ni potrebno za ocenjevanje stanja ene tarče upoštevati tudi stanj vseh ostalih tarč. Vendar je tak pristop precej naiven, saj se pogosto zgodi, da po trku ali zakrivanju med podobnimi tarčami več sledilnikov sledi isto tarčo in sledenje odpove [81]. Za reševanje tega problema so različni avtorji predlagali metode kot so vzvratna projekcija s histogrami (angl. histogram back-projection) [142], metode z alarmi zakrivanja (angl. occlusion alarm probability) in metode s predlogami [34]. Kljub temu te metode odpovejo, kadar so si tarče vizualno podobne in se gibljejo ena ob drugi.
Povzetek
xxi
Izvirni znanstveni prispevki
V disertaciji smo se ukvarjali z razvojem verjetnostnih modelov za sledenje oseb v video podatkih. Raziskali smo različne verjetnostne modele vizualnih in dinamičnih lastnosti tarč, kakor tudi pristopov za sledenje več tarč s ciljem predlagati rešitve za izboljšavo sledenja, ki bistveno ne povečajo čas procesiranja in s tem ne upočasnijo sledenja. Izvirni prispevki k znanosti so sledeči:
•  Razvili smo na barvi temelječ vizualni model, ki izboljša sledenje, kadar je barva sledenega objekta podobna barvi ozadja. Predlagani vizualni model uporablja novo mero prisotnosti za detekcijo osebe v nekem položaju v sliki, ki upošteva model ozadja. Razvili smo novi verjetnostni model mere prisotnosti, ki omogoča uporabo mere prisotnosti v filtru z delci. Vizualni model ne zahteva zelo natančnega modela ozadja, temveč le približno oceno le-tega. Za povečanje robustnosti na barvo ozadja, vizualni model generira masko za izločevanje slikovnih elementov, ki z večjo verjetnostjo pripadajo ozadju. Predlagali smo tudi novo metodo za adaptacijo modela trenutnim vizualnim lastnostim tarče.
•  Predlagali smo sestavljeni vizualni model, ki združuje barvno informacijo z značilnostmi lokalnega gibanja, kar razreši probleme zakrivanja med vizualno podobnimi objekti. Značilnico lokalnega gibanja izračunamo iz redkega optičnega toka v točkah, ki imajo dovolj teksture. Razvili smo verjetnostni model lokalnega gibanja, ki upošteva tako možnost napake v oceni optičnega toka kot spremembe v smeri gibanja tarče. Lokalno gibanje smo združili z barvnim modelom v sestavljeni model s predpostavko, daje gibanje objekta pogojno neodvisno od njegove barve. Predlagali smo pristop s katerim se model lokalnega gibanja prilagaja gibanju tarče med sledenjem.
•  Predlagali smo dvostopenjski dinamičen model, ki združuje liberalni in konzervativni model za vernejše modeliranje gibanja tarče ter metodo za nastavitev parametrov liberalnega modela.
Dvostopenjski dinamičen model je sestavljen iz dveh dinamičnih modelov: liberalnega in konzervativnega. Liberalni model dovoljuje velike spremembe v dinamiki gibanja tarče in je uporabljen v filtru z delci za učinkovito pokrivanje   prostora  stanj   parametrov  tarče.      Model   smo  izpeljali   z
xxii
Povzetek
modeliranjem hitrosti z Gauss-Markovim procesom s srednjo vrednostjo različno od nič in je zato sposoben dobro opisovati vrsto različnih gibanj, od naključnih prehodov (angl. random walk) pa vse do skoraj konstantnih hitrosti (angl. nearly-constant velocity). Konzervativni model predpostavlja bolj stroge omejitve v hitrosti tarče. V sledilniku konzervativni model ocenjuje srednjo vrednost Gauss-Markovega procesa v liberalnem modelu in hkrati regularizira oceno stanja tarče iz filtra z delci. Izvedli smo analizo parametrov dinamičnega modela in predlagali praktično metodo za ocenjevanje spektralne gostote šuma v liberalnem modelu.
• Predlagali smo na kontekstu temelječo metodo za sledenje večjega števila tarč ob linearni računski zahtevnosti. V kontekstu opazovanja scene s ptičje perspektive, lahko posneto sliko razdelimo v regije, tako da vsaka regija vsebuje le po eno tarčo. To pomeni, da se pri znani razdelitvi Bayesov filter za več tarč poenostavi v več sledilnikov za posamezne tarče, tako da vsak sledilnik omejimo na svoje področje v sliki. Predlagali smo parametričen model regij, ki zahteva določitev zgolj položajev sledenih objektov. Ker razdelitev ni znana pred iteracijo sledenja, smo razvili metodo ki iterira med ocenjevanjem položaja tarč in izboljševanjem ocene razdelitev. S to metodo hkratno ocenjujemo položaje tarč v sliki, kakor tudi ocenjujejmo neznano razdelitev slike.
V nadaljevanju podajamo podrobnejši pregled vsebine doktorske disertacije s poudarkom na prispevkih k znanosti.
Podrobnejši pregled vsebine
V Poglavju 2 smo podrobno opisali verjetnostni okvir, imenovan flitri z delci (angl. particle filters), v katerem smo obravnavali problem sledenja. Najprej smo sledenje zastavili kot problem stohastičnega ocenjevanja in nato predstavili znano konceptualno rešitev, do katere pridemo z aplikacijo Bayesovega rekurzivnega filtra. Po kratkem pregledu zgodovinskih pristopov k rekurzivnem filtriranju smo pokazali kako lahko rešimo rekurzije z metodami Monte Carlo in rezultat so filtri z delci.
Poglavje 3 je posvečeno razvoju barvnega vizualnega modela tarče,  ki je eden od poglavitnih delov sledilnika, saj omogoča ocenjevanje ali se tarča
Povzetek
xxiii
nahaja v nekem poloˇzaju v sliki. Barvni vizualni model smo izpeljali iz barvnih histogramov in predlagali izboljˇsave, ki se nanaˇsajo na sledenje z uporabo barvne informacije. Prva izboljˇsava je bila na barvi temeljeˇca mera prisotnosti. Predlagana mera prisotnosti uporablja oceno slike ozadja za zmanjˇsevanje vpliva ˇsuma v ozadju1. Z uporabo metode izbire modelov (angl. model selection) in metode najveˇcjega verjetja (angl. maximum likelihood) smo izpeljali funkcijo verjetja (angl. likelihood function), ki omogoˇca verjetnostno interpretacijo vrednosti mere podobnosti, kar omogoˇca integracijo v okvir verjetnostnega sledenja. Problem se pojavi kadar se tarˇca giblje po barvno podobnem ozadju, saj v tako skrajnih primerih mera prisotnosti ne razloˇcuje dovolj dobro med ozadjem in sledenim objektom. Zaradi tega vizualni model poskuˇsa oceniti masko za izloˇcanje slikovnih elementov, ki ne pripadajo tarˇci. Kadar se osvetljava scene spreminja, ali kadar se kamera trese, je ponavadi teˇzko pridobiti natanˇcen model ozadja. Zaradi tega smo se osredotoˇcili na uporabo zgolj preprostega modela in predlagali postopek za dinamiˇcno izloˇcanje ozadja. V naˇsem pristopu se maska generira posredno, preko ocene podobnosti sledenega objekta in ozadja ter se v tem smislu individualizira sledenemu objektu. Dodatna izboljˇsava je metoda za selektivno adaptacijo vizualnega modela, ki prepreˇcuje adaptacijo v primerih, ko je sledeni objekt zakrit ali je ocena njegovega poloˇzaja v sliki napaˇcna. Predlagali smo pristop kako vse te izboljˇsave verjetnostno povezati v sledilnik, ki temelji na filtru z delci. Rezultati eksperimentov so pokazali, da predlagane reˇsitve moˇcno izboljˇsajo sledenje v primerih, ko je sledeni objekt podoben ozadju in kadar prihaja do kratkotrajnih zakrivanj med vizualno podobnimi objekti. Vendar so eksperimenti tudi pokazali, da sledenje vseeno odpove, kadar se sledeni objekt pribliˇza ali se zakrije z barvno podobnim objektom.
V Poglavju 4 predlagamo razˇsiritev barvnega modela z dodatnim modelom, ki ga imenujmo model lokalnega gibanja, v novi, sestavljeni vizualni model. Znaˇcilnico lokalnega gibanja izraˇcunamo preko optiˇcnega toka, ki ga ocenimo z algoritmom Lukas-Kanade. Algoritem Lukas-Kanade je sicer relativno hiter, vendar slabo ocenjuje optiˇcni tok v toˇckah kjer slika vsebuje le malo teksture. Zato najprej uporabimo Shi-Tomasijeve znaˇcilnice za doloˇcevanje podroˇcji z zadostno teksturo in izraˇcunamo optiˇcni tok le v teh toˇckah. Tako je znaˇcilnica lokalnega gibanja doloˇcena zgolj z uporabo redke (angl.  sparse) reprezentacije
1Z besedno zvezo ”ˇsum ozadja” mislimo na slikovne elemente, ki so barvno podobni slikovnim elementom, ki pripadajo sledenemu objektu.
xxiv
Povzetek
optiˇcnega toka v sliki. Da lahko upoˇstevamo moˇznost napake v oceni optiˇcnega toka in spremembe v gibanju tarˇce, smo razvili verjetnostni model lokalnega gibanja. Ker se model lokalnega gibanja moˇcno spreminja med gibanjem tarˇce, smo razvili metodo za prilagajanje modela, ki upoˇsteva oceno hitrosti sledenega objekta. Model lokalnega gibanja smo z verjetnostnimi pristopi zdruˇzili z barvnim modelom v sestavljen vizualni model tarˇce. Predlagali smo verjetnostni sledilnik, ki temelji na filtrih z delci in uporablja sestavljeni vizualni model za sledenje. Predlagani sledilnik smo preizkusili na primerih sledenja dlani in sledenja oseb v nadzorovanju ter ˇsportu. Rezultati eksperimentov so pokazali, da sestavljeni model uspeˇsno razreˇsuje zakrivanja med vizualno podobnimi objekti in omogoˇca izboljˇsano sledenje.
V Poglavju 5 smo se osredotoˇcili ˇse na en zelo pomemben sestavni del verjetnostnega sledilnika – dinamiˇcni model tarˇce. Predlagali smo dvonivojski dinamiˇcni model in dvonivojski sledilnik, ki lahko upoˇsteva razliˇcne tipe dinamike gibanja. Dvonivojski model je sestavljen iz dveh dinamiˇcnih modelov: liberalnega in konzervativnega. Liberalni dinamiˇcni model smo izpeljali iz predpostavke, da lahko modeliramo hitrost objekta z Gauss-Markovim procesom s spremenljivo srednjo vrednostjo. Analiza parametrov liberalnega modela je pokazala, da sta dva popularna dinamiˇcna modela, model nakljuˇcnega prehoda (angl. random walk, RW) in model skoraj konstantne hitrosti (angl. nearly-constant velocity, NCV), zgolj posebni obliki liberalnega modela, ki nastopita pri limitnih vrednostih njegovih parametrov. Z izbiro parametrov med limitnimi vrednostmi, lahko liberalni dinamiˇcni model dobro opisuje dinamike, ki so med RW in NCV. Zelo pomemben parameter liberalnega dinamiˇcnega modela je spektralna gostota ˇsuma v Gauss-Markovem procesu. Ta je odvisna od dinamike znaˇcilne za razred sledenih objektov. Zato smo predlagali metodo za praktiˇcno doloˇcevanje spektralne gostote, ki zahteva poznavanje zgolj sploˇsnih lastnosti gibanja objekta. Drugi pomembni parameter liberalnega modela je srednja vrednost Gauss-Markovega procesa, saj omogoˇca nadaljno prilagoditev sledilnika dinamiki tarˇce. Za uˇcinkovito ocenjevanje te vrednosti med sledenjem uporabljamo konzervativni dinamiˇcni model v dvonivojskem sledilniku. V nasprotju z liberalnim modelom konzervativni model predpostavlja, da je trenutna hitrost objekta zgolj linearna kombinacija preteklih hitrosti in tako vsiljuje moˇcnejˇse omejitve hitrosti objekta. Predlagani dvonivojski dinamiˇcni model uporablja liberalni model znotraj filtra z delci za uˇcinkovito raziskovanje prostora stanj parametrov tarˇce.   Po drugi
Povzetek
XXV
strani dvonivojski model uporablja konzervativni dinamični model za oceno srednje vdernosti Gauss-Markovega procesa v liberalnem dinamičnem modelu in za regularizacijo ocen pridobljenih iz filtra z delci. Rezultati eksperimentov so pokazali, da v primerjavi s popularnima in pogosto uporabljenima dinamičnima modeloma dvonivojski model dosega natančnejše ocene stanj ob manjšem številu delcev v filtru z delci. To precej zmanjša čas, ki je potreben za procesiranje ene iteracije sledenja.
V Poglavju 6 smo razširili predstavljene rešitve za sledenje posameznih tarč na sledenje več tarč. Osredotočili smo se na aplikacije, kjer je kamera postavljena tako, da na sceno gleda s ptičje perspektive in predlagali novo, na kontekstu temelječo, metodo za sledenje več tarč. V kontekstu opazovanja scene s ptičje perspektive smo izpeljali omejitve, ki poenostavijo problem sledenja več tarč. Te omejitve narekujejo, da lahko opazovano sceno razdelimo v nabor neprekrivajočih se regij, tako da vsaka regija vsebuje le po eno tarčo. Omejitve smo formalizirali s parametričnim modelom za razdelitev slike. V Bayesovem smislu deluje parametrični model kot latentna spremenljivka, ki pri znani vrednosti poenostavi Bayesov filter za več tarč in omogoča sledenje vsake tarče z lastnim sledilnikom. To močno zmanjša računsko kompleksnost problema sledenja več tarč. V Poglavju 5 predstavljeni dvonivojski dinamični model je uporabljen v filtru z delci posamezne tarče, kar naredi sledilnik še bolj učinkovit v smislu časa porabljenega za procesiranje ene iteracije sledenja. Predlagani na kontekstu temelječ sledilnik smo preizkusili na zahtevni bazi podatkov, ki je vsebovala posnetke tekem košarke in rokometa. Sledilnik smo primerjali z referenčnim sledilnikom, ki se je od predlaganega razlikoval le v tem, da ni uporabljal modela razdelitev slike (konteksta) in je bil zgolj nabor neodvisnih sledilnikov posameznih tarč. V vseh preizkusih je predlagani sledilnik močno zmanjšal število odpovedi v primerjavi z referenčnim sledilnikom in omogočal sledenje tudi v primerih, ko je med večimi igralci prišlo trkov ter prerivanj.
Rezultati in prispevki doktorata so ponovno povzeti v Poglavju 7, kjer poudarimo prednosti ter slabosti predlaganih rešitev. V luči le-teh začrtamo smernice za nadaljne delo in možne izboljšave metod za sledenje oseb.
xxvi                                                                                                       Povzetek
Although tracking itself is by and large a solved problem, ...
Jianbo Shi and Carlo Tomasi, 1994
Chapter 1
Introduction
Tracking people in video data is a part of a broad domain of computer vision that has received a great deal of attention from researchers over the last twenty years. This gave rise to a body of literature, of which surveys can be found in the work of Aggarval and Cai [2], Gavrila [51], Gabriel et al. [49], Hu et al. [60] and Moeslund et al. [114, 115]. Computer-vision-based tracking has found its place in many real world applications; among these are:
•  Visual surveillance, where the aim is to track people or traffic and detect unusual behaviors.
•  Video editing, where the aim is to add graphic content over a moving object or a person in a video recording.
•  Analysis of sport events to extract positional data of athletes during a part of the sports match. These data can be then used by sports experts to analyze the performance of athletes.
•  Tracking of laboratory animals such as insects and rodents with aim to studying interactions of natural multi-agent systems.
•  Human-computer interfaces used in the intelligent ambients which aim to assist people in their everyday tasks.
•  Cognitive systems, which can use tracking to learn about dynamic properties of different objects in their environment.
A prominent problem in tracking from video are the inherent uncertainties associated with the visual data and the uncertainties associated with the
1
2
Introduction
dynamics of the tracked objects. One way to account for these uncertainties is to consider the problem of tracking in the context of statistical estimation of the target’s state (e.g., position) over time. More precisely, the information of the current state of the target is presented as a probability density function in the target’s state space. Tracking is then posed as a problem of recursive estimation of the target’s posterior distribution at each time-step in light of new measurements. Under the assumption that the target’s dynamics and measurement process can be described by a linear, Gaussian, processes the estimation of the posterior can be calculated in a closed-form through the well-known Kalman filter [76]. However, the assumptions made by Kalman filter are usually too unrealistic for visual tracking and thus result in a degraded performance. Various extensions have been proposed over the years to account for more realistic models, however none of them could deal with the arbitrary forms of the target’s posterior. In late 90s Isard and Blake [64] presented a method called Condensation algorithm for efficiently calculating the posterior of the target, that does not require restrictions imposed by the Kalman filter. This method came from a general class of sequential Monte Carlo methods known as the particle filters [6, 43]. In contrast to Kalman filter, particle filters do not assume a Gaussian form of the target’s posterior, but rather present distributions by weighted sets of samples (particles). Each sample presents a realization of the target’s state, and tracking then proceeds in two steps. These steps involve simulating the samples using a proposal distribution and recalculating their weights using the target’s dynamic model and a likelihood function, which tells how likely each simulated state is, given the observation. The proposal distribution can serve as means of using auxiliary information to guide particles in more probable regions of the state space. When no such information is available, a common approach is to use the dynamic model as the proposal, which gives the widely-used bootstrap particle filter [53].
The efficiency of visual tracking with particle filters depends a great deal on the following subparts of the method:
•  Visual cues which are used to encode the visual properties of the tracked objects.
•  A dynamic model that describes the dynamics of the tracked object.
•  A multiple target management system for keeping track of the identities of multiple objects in cases when multiple targets are considered.
1.1 Related work
3
The three subparts listed above will be the focus of this thesis. We will consider them in the context of probabilistic tracking of persons in video data. The main contributions will concern probabilistic visual models, probabilistic dynamic models and probabilistic schemes for tracking multiple targets. The remainder of this section is structured as follows. In Section 1.1 we review the related work on visual cues, dynamic models and probabilistic approaches to tracking multiple targets. In Section 1.2 we give a detailed description of our contributions and in Section 1.3 we give the thesis outline.
1.1    Related work
1.1.1    Visual cues
The visual cues incorporate the visual information which is extracted from the images and is used to encode the visual properties of the tracked objects. Based on the type of the visual information contained in these models we can divide them into the following three classes: shape-based models, appearance-based models and motion-based models.
Shape models
The early approaches to modelling shape used deformable lines, or snakes, [148] which were iteratively fitted to the features corresponding to the edges of the object. The main disadvantage of these methods was their sensitivity to noise and could not handle well situations, where the object was occluded by another object. Furthermore, those models were sensitive to the presence of spurious edges in the background. When we consider a class of objects with similar shapes, contour models such as point distribution models (PDM) [37] can be used. These have been successfully applied to modelling shapes of objects such as resistors [36] and have been demonstrated on an example of tracking pedestrians [50]. The PDMs are built from sets of examples of labelled points on the boundary of the object to be identified. A compact representation of the object is found through principal component analysis (PCA), by retaining a low-dimensional subspace spanned by the dominant modes of variation.
4
Introduction
Active shape models [19] based on B-splines with equally-spaced control points around the object’s outline have been used to capture the expected shapes of pedestrians for visual surveillance in [12] and tracking leaves of bushes [19]. The shape space of active contours can be efficiently constrained to a set of plausible shapes by building a probability density function (pdf) over the parameters of the contour [19]. To avoid specific parametrization of the object’s contour, level sets [108] have been proposed. Level sets are based on translating the explicit modelling of the curve into modelling a higher-dimensional embedding function. A constraint is imposed on this embedding function to yield regions inside and outside of the shape/contour. One advantage of level sets over active contours is that the embedding function can handle well topological changes in shape such as splitting and merging. An example of using level sets for tracking silhouettes of humans in noisy images can be found in [38].
In cases when the tracked objects are small or change their shape rapidly, alternative shape features may be more appropriate. In application of tracking in sports, Perˇs and Kovaˇciˇc [125] encoded the players’ shapes by utilizing 14 binary Walsh-function-like kernels. The kernels were used to encode the shape of the target in the current time-step and used in the next time-step to yield the most likely position of that target. To capture the variability in shape of football players, Needham [116] encoded the shapes of the players using a set of five pre-learned multi-resolution kernels which were learned in a semi-supervised manner from hand-labelled binary images of the players. When the tracked objects occupy larger areas in the image, more detailed shape models can be applied. Dimitrijeviˇc et al. [42] used motion capture data to extract a large database of sequences of human shapes. These were used to detect key poses of walking humans in images by chamfer matching [121]. They defined the key pose as the pose when both feet of a person were on the ground and the angle between the legs was greatest. However, chamfer matching typically yields many false detections in real-life environments (e.g., [96, 42]). For that reason, the authors apply a temporal constraint by comparing three sequential frames with three sequential silhouettes in the template, and apply a statistical-relevance method to determine which parts of the silhouette are most significant for the task of detection. This methodology was extended in [48] to interpolate between detections and thus create trajectories of walking people. An implicit shape model was proposed by Liebe et al. [95] to detect pedestrians walking in parallel to the
1.1 Related work
5
image plane of the camera. Their approach uses a pre-learned codebook of patches extracted from pedestrians and applies a probabilistic Hough voting procedure. At the learning stage, patches are sampled from a set of pre-segmented images of pedestrians and a codebook of patches is generated. Then the extracted patches are revisited to create spatial occurrence distribution for the codebook; at that stage also the figure-ground map is recorded for each patch. In the recognition stage, candidate patches are extracted, matched to the codebook, and a spatial probability distribution of object locations is created. Detection is then carried out simply by detecting the modes in the location distribution.
Dalal and Triggs [40] represented human shape by histograms of oriented gradients (HOG). They first divided an image into smaller cells, and for each cell a one-dimensional histogram of gradient directions was constructed. A support vector machine (SVM) was then used with these features to detect humans in rectangular regions in the image. Using a boosting approach, Zhou et al. [177] were able to speed up HOG-based detection up to nearly real-time. Lu and Little [102] adopted HOGs to track and detect actions of hockey players. The key difference was that they used a separate reference HOG model of each player and used a particle filter to generate a set of hypothesized locations of the players in a given time-step. HOGs were extracted from image at these hypothesized locations and then probabilistically compared to the reference HOGs to refine the hypotheses. Hotta [59] applied a bank of Gabor filters to detect edges in images and used the filtered images to detect and track faces. A face was encoded by dividing a predefined rectangular region into nonoverlapping blocks, and a SVM classifier was trained on each block separately using a database of presegmented faces. During a detection stage, the responses of these local classifiers were combined to classify the region into a face or a non-face. Zhao and Thorpe [174] calculated gradients from silhouettes of objects extracted from depth data. A neural network was then used on the calculated gradients to verify if a given silhouette originated from a human.
One drawback of the shape-based visual models is that they do not take into account the color properties of the target. Thus these models can fail to maintain the identity of the object in presence of multiple other objects of the same shape class. When constructing models that explicitly model the object’s outline, great care must be taken to capture the variability of the entire class of objects we want
6
Introduction
to track. Furthermore, the construction of these models may require specialized hardware.
Appearance models
The early approaches in color-based tracking [62, 142] utilized color templates, which were extracted at the estimated position of the target in one frame and used to localize the same target in the next frame. A more elaborate adaptive statistical model of object’s appearance was used by Senior [141] in application of visual surveillance. Each object was presented by a rectangular array of pixels and the color distribution of each pixel was then modelled by a single Gaussian. Along with that, a mask function was estimated online to determine which pixels in the rectangle correspond to the object and which do not. Lim et al. [100] also encoded humans by regions within rectangles and modelled the dynamics of changing appearance. This was achieved by projecting pixels inside of a rectangle to a low dimensional subspace using a nonlinear local-linear-embedding algorithm. The dynamics of the appearance of a walking human were learned in this subspace. Jepson et al. [68] tackle the problem of appearance changes by modelling the appearance by three components: a slowly changing, a rapidly changing and a noise component. They use the expectation maximization (EM) algorithm to update the components.
Utsumi and Tetsutani [158] used a prior knowledge of appearance to detect humans in images. They partitioned the image into a number of cells and compared variances and mean values of intensities among proximal cells. Detection of humans was based on the assumption, that for the images with humans, the distances among the cells will be smaller than for images without humans. In application of sports tracking, Ok et al. [119] noted that the player can usually be described by two colors: the color of the shirt and the color of the shorts. Therefore, they divided each player into two separate regions and encoded each region by the mean value of the color within that region. Wren et al. [169] presented a system called Pfinder which was based on segmenting a human into a set of blobs and encoding each blob by an ellipse and its color. This approach, however, works only when a single person is in the scene, and requires a controlled environment. A more robust approach is to apply body-part detectors to identify locations of the body parts which can then be combined probabilistically to detect people [113, 131].   A drawback of this approach is
1.1 Related work
7
that its performance can deteriorate in the real-world images, since they usually contain many limb-like objects.
An often used approach to modelling color-based appearance is application of color histograms [153]. The color histograms have been successfully applied in many applications of visual tracking [60, 123, 118, 162, 35, 120, 34, 128, 109, 7]. A common approach is to use a single histogram (eg. [123, 118, 162]) to encode the object’s appearance. Comaniciu and Meer [35] attempted to increase the robustness of tracking by considering also a histogram from a neighborhood of the tracked object to determine the salient components of the object’s appearance model. A similar approach was adopted by [7] to determine color salient regions on the object’s appearance. Some attempts to explicitly include spatial information into histograms were presented, eg., in [120, 109] where separate histograms were used to encode the upper and lower parts of person’s appearance. Birchfield and Rangarajan [14] proposed a class of color histograms that implicitly integrates the spatial information of the target’s color. This is done by keeping track of spatial statistics for colors of each bin in the color histogram. Another popular approach to modelling the appearance is using parametric models such as mixture of Gaussians (MoG) to model entire color distribution [112, 78, 80, 172] or to approximate only the dominant colors in the distribution [55]. Wang et al. [167, 168] extend MoGs by also considering spatial information and call these extended mixture models SMoGs. They also propose an EM-based algorithm to update SMoGs online. Recently, Tuzel et al. [156] introduced covariance-based descriptors of appearance. In their approach, each pixel in a rectangular region containing the object of interest is presented by a set of features. These features may be intensity values of color channels, gradients, etc. An appearance model is obtained by calculating the covariance matrix of the features over all pixels in the rectangle. This reference covariance is compared to the covariance in the candidate region by using a generalized eigen-value-based distance measure. Babu et al. [7] proposed an appearance model for tracking nonrigid objects that can be considered a combined between a color-template- and a color-histogram-based approach. The model is constructed by selecting small neighborhoods of pixels within the object’s bounding box. These neighborhoods are encoded by the color templates as well as color histograms. During tracking, the color templates are used to obtain a rough estimation of the object position in the current frame and then histograms are used to refine the position.
8
Introduction
Many of the visual models described sofar are based on encoding some shape, color or gradient visual properties of the tracked objects. When a target is moving in a clutter, a single visual model may not be sufficient to discriminate the target from the background. For that reason several authors have proposed to track with combinations of these models. Li and Chaumette [97] combine shape, color, structure and edge information to improve tracking through varying lighting conditions and cluttered background. Similarly, Stenger et al. [152] and Wang et al. [168] combine color and edge features to make tracking robust to background clutter. Per´ez et. al. [124] propose to integrate sound cues with the visual cues to improve head tracking for specialized applications. Since all visual cues may not describe the target’s appearance equally well, Brasnett et al. [24] proposed a weighted scheme to combine edge, color and texture cues. Even though fusing several visual models may improve tracking, these models are still intensity-related and are prone to fail in situations when the target is located in a close proximity of another visually similar object. Thus, another approach is to utilize an appearance-independent cue such as the motion of pixels.
Motion-based models
Sidenbladh and Black [144] use filter responses to learn statistics of motion and appearance from a large number of training examples of different body parts for human pose estimation. Viola and Jones [163] improved pedestrian detection by learning a cascade of weak classifiers on manually extracted patches of differences between consecutive images. A probabilistic model of local differences in consecutive images was proposed by P´erez et al. [124]. They partition the image into an array of cells and assume that a cell contains motion if the differences in that cell are approximately uniformly distributed. A Parzen estimator [140] is then applied to produce a motion-based importance function, which is used within a particle filter to guide particles into the regions of the image which contain motion. A drawback of methods which rely on image differencing is that they are essentially local-change detectors and therefore cannot resolve situations when a target is occluded by a moving, visually similar, object.
An obvious solution is thus to take into account the apparent motion in the images – the optical flow. Various bottom-up approaches have been proposed recently, which are based on clustering similar flows to yield moving objects. Gonzalez et al. [52] applied a Kanade-Lucas-Tomasi (KLT) feature tracker [103]
1.1 Related work
9
which used optical flow to track and cluster points on a human body. The robustness of tracker was increased by applying a radial-basis-function network to filter the optical flow. Another attempt to track solely by the optical flow was presented by Du and Piater [44]. In their approach a KLT feature tracker was implemented in the context of a mixture particle filter. Targets were identified in each frame by clustering similar optical flow features. A similar approach was used in [130], where the current flow vectors were clustered by region growing and pruning using affine motion consistency as a criterion. Recently, an approach was presented in [25] where the optical flow was used to extract stable trajectories of features. At each time-step they considered a temporal trajectory of each active feature for thirty frames forward and backward in time. These trajectories are first clustered into a large, predefined, number of clusters. A distance tree is then built among the clusters and a minimum-description-length method is applied to iteratively merge clusters into consistently moving objects. The same approach was adapted by [99] where the feature consistency criterion was formulated through potential functions among different flow trajectories. These potential functions considered motion coherence, spatial coherence as well as temporal inertia. The features were then clustered hierarchically and heuristics were used to decide when to stop clustering. Bugeau and P´erez [28] introduce the color information in the clustering stage and apply graph cuts to improve segmentation of the object from the background. Assuming that discontinuities in the optical flow occur at the boundaries of a moving object, Lucena et al. [105, 104] were able to track a moving person’s palm using a contour tracker, which was based on detecting these discontinuities.
A drawback of the approaches which are based on clustering flow vectors is that, due to the clustering procedure and the nature of the optical flow data, they cannot maintain correct identities of the targets after full occlusion even if the targets are of different colors. Furthermore, those approaches that rely on the assumption that the target is always in motion are prone to failure when the target stops moving or moves significantly less than another visually similar object in the neighborhood of the target.
1.1.2    Dynamic models
While the visual models are used to capture the visual properties for tracking objects, dynamic models are used to describe their dynamics, i.e., how the objects
10
Introduction
are expected to move in the image. When dynamics of the tracked object are known, the search space of the parameters to be estimated during tracking can be constrained considerably. This aids to resolve ambiguities in the visual data as well as possibly reducing the processing time required for a single tracking iteration, as smaller portions of the parameter space need to be explored. In this respect, dynamic models have been extensively used in human pose estimation. Sidenbladh et al. [145] apply a strong prior of walking motion to determine the possible movement directions of a tracked person. The prior is learned using a large database of indexed examples. Agarwal and Triggs [1] use a set of second order dynamic models to track articulated motion of humans during walking and running. Urtasun et al. [157] use scaled Gaussian process latent variable models with incorporated dynamics to learn a low-dimensional embedding of the pose space for specific movements like golf swings and walking.
In order to cover a range of possible dynamics of the tracked object, some authors have proposed an interacting multiple model (IMM) approach. In this approach multiple trackers, each with a different dynamic model, are used in parallel for tracking the target. A special scheme is used to determine how well each model describes the target’s current motion and the estimates from different trackers are then combined accordingly. A detailed treatment of different combination schemes is given in [10]. The interacting multiple model approaches based on Kalman filters have received considerable attention in the work on aircraft tracking with radars [98, 9], and an application to camera gaze control can be found in [23]. A particle-filter-based implementation of IMM can be found in [111, 20]. A drawback of IMM approaches is that the complexity of tracking increases dramatically, since now the probability distributions have to be estimated over each of the interacting models. In particle filters, the likelihood function of observations has to be evaluated for each hypothesis (particle). In visual tracking, calculating the likelihoods of particles is usually time-consuming since the visual model has to be calculated for each particle and compared to the reference model. Thus computational efforts of visual tracking with particle filters is considerably increased when using IMM approaches.
For many applications, such as tracking in sports, gesture-based humancomputer interfaces and surveillance, it is difficult to find a compact set of rules that govern the target’s dynamics. Because of this, and the computational complexity associated with IMM methods, researchers usually model the target’s
1.1 Related work
11
motion using a single model. The common choices are a random-walk (RW) model or a nearly constant velocity (NCV) dynamic model; see [136] for good treatment of these. The RW model describes the target’s dynamics best when the target performs radical accelerations in different directions, e.g. when undergoing abrupt movements. However, when the target moves in a certain direction (which is often the case in sports and surveillance), the RW model performs poorly and the motion is better described by the NCV model. Thus, to cover a range of different motions, a common solution is to choose either a RW or a NCV model, and increase the process noise in the dynamic model. However, to have a sufficiently dense coverage of the probability space, and therefore a satisfactorily track, the number of particles also needs to be increased in the particle filter. This, in turn, introduces additional likelihood evaluations, which slows down the tracking.
1.1.3    Managing multiple targets
A non-trivial task when tracking multiple targets is maintaining the correct identities of the targets. In the estimation theory, a classical approach to tracking multiple targets involves a detection step followed by the target-to-measurement association. In addition to the Nearest Neighbor (NN) filter, techniques such as the Joint Probabilistic Data Association Filter (JPDAF) are common solutions to the association problem [56]. The applications of sports tracking based on the NN and JPDAF approaches can be found in [171, 66, 30] and [77], respectively. Some earlier applications of the JPDAF in the context of computer vision can be found in [132, 138]. The weakness of these approaches is that they involve an explicit detection and exhaustive enumeration of the associations, which leads to an NP-hard problem. Some attempts to reduce the complexity of the association problem include gating [171, 66, 30] and treating the associations as random variables which can then be assigned via sampling [61].
Another way to tackle the problem of tracking multiple targets is to concatenate the states of all the targets into a single joint-state. This makes it possible to apply particle-filtering techniques developed for single-target tracking [123, 116]. By introducing an additional variable that indicates the number of targets to the joint-state, the authors of BraMBLe [65] were able to track a varying number of visually similar targets. This approach was adopted by Czyz et al. [39] to track soccer players of the same team. The weakness of the joint-state particle filters is that a poor estimate of a single target may degrade the entire
12
Introduction
estimation. For this reason, the number of particles needs to be increased, which may render the tracker computationally inefficient for more than three or four targets [81]. Recently, some efficient schemes based on Markov Chain Monte Carlo approaches have been proposed [175, 81] to solve this problem. Vermaak et al. [162] formulated the problem of tracking visually identical targets as the problem of maintaining the multi-modality of the estimated posterior distribution of the target states. The multi-modality of the posterior is maintained by a mixture particle filter. This approach was later applied by Okuma et al. [120] and Cai et al. [30] to track players in a hockey match. With a similar rationale, Chang et al. [32] apply a Parzen estimator [165] to the particle set and use a Mean Shift to detect the modes of the posterior.
A simple solution when the number of targets is known is to track each target with a separate tracker. This approach reduces the size of the statespace and allows tracking of a specific target without the need to track all of the other targets as well, thus reducing the computational complexity of the tracker. However, this approach is rather naive, since the target with the highest score will often hijack the trackers of the nearby targets [81]. Solutions based on the histogram back-projection technique [142], occlusion alarm probability principle [119] and template-based methods [34] were proposed in the literature to cope with the problem of hijacking. However, when targets appear visually similar, these approaches still fail to maintain the correct identities after targets come close to each other.
1.2    Contributions
In this thesis we deal with probabilistic models for tracking persons in video data. We explore various probabilistic models concerning the visual and dynamic properties of persons and approaches to track multiple targets with the goal to arrive at solutions that allow improved tracking performance, while at the same time not significantly increasing the processing time. We propose several improvements in visual models, dynamic modelling and schemes of multiple target tracking. The original contributions of the thesis are as follows:
• A color-based visual model for tracking persons is derived, which improves tracking in situations when the color of the tracked
1.2 Contributions
13
object is similar to the color of the background. The proposed color-based visual model uses a novel measure which incorporates the model of the background to determine whether the tracked target is positioned at a given location in the image. A probabilistic model of the novel measure was derived, which allows using the color-based visual model with the particle filter. The visual model does not require a very accurate model of the background, but merely a reasonable approximation of it. To increase robustness to the color of the background, a mask function is automatically generated by the visual model to mask out the pixels that are likely to belong to the background. A novel adaptation scheme is applied to adapt the visual model to the current appearance of the target.
•  A combined visual model is proposed, which fuses the color information with the features of local motion, to resolve occlusions between visually similar objects. The local-motion feature is calculated from a sparse estimate of the optical flow, which is evaluated in images only at locations with enough texture. A probabilistic model of the local-motion is derived which accounts for the errors in the optical flow estimation as well as for the rapid changes in the target’s motion. The local-motion model is probabilistically combined with the color-based model into a combined visual model using an assumption that the color is conditionally independent of motion. An approach is also developed to allow adaptation of the local-motion model to the target’s motion.
•  A two-stage dynamic model is proposed, which combines the liberal and the conservative model to better describe the target’s motion, and a method for setting the parameters of the model is derived. The two-stage dynamic model is composed of two interconnected dynamic models: the liberal and conservative. The liberal model allows larger perturbations in the target’s dynamics and is used within the particle filter to efficiently explore the state space of the target’s parameters. This model is derived by modelling the target’s velocity with a non-zero-mean Gauss-Markov process and can explain well motions ranging from a complete random walk to a nearly-constant velocity. The conservative model imposes stronger restrictions on the target’s velocity and is used to estimate the mean value of the Gauss-Markov process in the liberal model, as well as for regularizing the estimated state from the particle filter. We
14
Introduction
have provided an analysis of the model’s parameters and proposed a rule-of-thumb rule for estimating the spectral density of the liberal model.
• A context-based scheme for tracking multiple targets is proposed, which allows tracking with a linear computational complexity. In the context of observing the scene from a bird’s-eye view, the recorded images can be partitioned into regions, such that each region contains only a single target. This means that, given a known partitioning, the Bayes filter for tracking multiple targets can be simplified into multiple singletarget trackers, each confined to the corresponding partition in the image. A parametric model of the partitions is developed, which requires specifying only the locations of the tracked targets. Since the partitions are not known prior to a given tracking iteration, a scheme is derived which iterates between estimating the targets’ positions and refining the partitions. Using this scheme we simultaneously estimate the locations of the targets in the image as well as the unknown partitioning.
1.3    Thesis outline and summary
The remainder of this thesis is structured into the following six chapters. In Chapter 2 we give a detailed description of the probabilistic framework called the particle filters, which we use for tracking. We first pose the tracking as a problem of stochastic estimation and show how the well-known conceptual solution emerges with application of the recursive Bayes filter. After briefly reviewing historical approaches to recursive filtering, we show how the particle filters emerge from the recursive Bayes filter when Monte Carlo methods are applied to solve the recursions.
In Chapter 3 we present the color-based visual model, the novel measure of presence, its probabilistic model, the dynamic background subtraction scheme and the scheme by which the visual model is adapted to the target’s appearance. We demonstrate in experiments how the proposed visual model improves the tracking performance in situations when the target moves over a cluttered background. We note that the purely color-based model cannot resolve ambiguities which arise in situations when the target undergoes an occlusion by a visually-similar object.
1.3 Thesis outline and summary
15
To cope with the drawbacks of the purely color-based visual model, we propose in Chapter 4 a novel local-motion-based model which is probabilistically combined with the color-based model into a combined visual model. We describe how the local-motion model is calculated from the optical flow. A probabilistic model of the local-motion is derived, and a scheme to adapt the local-motion to the current appearance of the target is presented. Experiments demonstrate how the proposed visual model can be used to resolve the visual ambiguities and improve tracking performance with examples of tracking people in surveillance, sports and with an example of tracking person’s palms.
Chapter 5 is dedicated to the problem of modelling the target’s dynamics. We propose a two-stage dynamic model and analyze how different values of parameters influence the structure of this model. We also propose a two-stage probabilistic tracker and confirm the superiority of the two-stage dynamic model over two widely-used dynamic models, both, quantitatively and qualitatively.
In Chapter 6 we discuss how the context within which we observe a scene can be used to simplify the Bayes recursive filter for multiple targets. Based on this discussion, and using the two-stage dynamic model, we derive a novel multi-target tracker. The proposed tracker can track multiple targets with a very small number of particles in the particle filter, which effectively reduces the processing time required for a single tracking iteration. The proposed multiple-target tracking scheme is evaluated on a demanding data set from sports.
In Chapter 7 we summarize this thesis. We discuss the achieved results and provide an outlook for future venues of research in the field of tracking persons from video data.
16                                                                                                     Introduction
He who controls the present, controls the past. He who controls the past, controls the future.
George Orwell (1903 – 1950)
Chapter 2
Recursive Bayesian filtering
Over the past fifty years, Bayesian approaches called Bayes recursive filters have been shown to provide a solid theoretical ground for probabilistic tracking and have been successfully applied to tracking in visual data. The central point of these approaches is to present all information about a target at one time-step with a posterior probability density function (pdf) over the target’s parameters, and estimate this pdf recursively as time progresses. The popularity of the recursive Bayes filter comes from its ability to handle various uncertainties associated with the target’s dynamics as well as sensors, by which the target is perceived. One of the early approaches that come from the class of Bayes recursive filters is the well-known Kalman filter [76]. Although the Kalman filter has been applied with some success to various problems of tracking and estimation, it has a major drawback. In particular, it makes certain assumptions, which are usually too restrictive to apply for visual tracking. In the late nineties, filters based on Monte Carlo methods have been proposed to solve the recursions in the Bayes filter. These approaches, known as the particle filters have gained considerable popularity in various areas of tracking and estimation and are the basis for the tracking algorithms which we propose in this thesis. Drawing heavily from the literature [101, 47, 43, 57, 6, 33], we provide in this chapter a derivation of the particle filters framework and discuss some implementation issues.
The outline of this chapter is as follows. In Section 2.1 we pose tracking as a problem of stochastic estimation in dynamic systems, which can be treated within the Bayesian framework. In Section 2.2 we introduce a principle at the core of the recursive Bayes filter and in Section 2.3 we describe how different terms in the filter relate to different parts of the stochastic dynamic system.   Some
17
18
Recursive Bayesian filtering
historical approaches to recursive filtering are briefly overviewed in Section 2.4. In Section 2.5 we finally show how the particle filters emerge when applying Monte Carlo methods to the Bayes filter.
2.1    Tracking as stochastic estimation
We can think about tracking as a process of obtaining some interesting information about a possibly moving target as time progresses. For example, in visual tracking, where an object is tracked in a sequence of images, the interesting information might be the position and size of the object in the image. To obtain such information, the target has to be modelled, and then this model is used to measure whether the target is located at a given position in the image. If certain values are assigned to the parameters of the model, we say that the target is in a certain state. The space of all parameter values is then called the state space.
Formally, we denote the target’s state at time-step k by xk. Starting with an initial state x0, we denote the sequence of all states up to the current time-step k by x0:k = {x0,. ..,xk}. The sequence x0:k is governed by a state evolution process, which is defined by the target’s dynamics. Since in general the evolution process is not fully deterministic and because there is always some uncertainty associated with how well the true target dynamics are modelled by the evolution process, the state of the target at any time-step is usually regarded as a random variable. Therefore, the state of the target at a given time-step is not described by a single value in the state space, but rather by a distribution of more likely and less likely values – a probability density function (pdf).
The information about the state of the target is accessed through the measurement process, which for every state xk produces a measurement (an observation) yk. We denote the sequence of all observed measurements up to time-step k by y1:k = {y1,. . .,yk}. Due to the uncertainty introduced by the imperfect target model and the inherent uncertainty of the measurement process, the measurements are also considered random variables.
With the definitions above, we see that in tracking we wish to calculate the current state of the target, which is governed by a stochastic process. Furthermore, the information about the state is accessed through another stochastic process. Therefore, tracking can be considered a problem of stochastic
2.2 Recursive solution
19
estimation of the target’s state (model parameters) as time progresses. A methodology that is well designed to handle the uncertainties involved in the stochastic estimation is provided from Bayesian theory.
From a Bayesian point of view, starting from a known prior distribution p(x0) over the initial state x0 of the target, all the interesting information about the sequence of the target states x0:fc is embodied by the posterior distribution p(x0:fc|yi:fc), which tells us the probability of how likely various state sequences x0:fc are in light of the observed sequence of measurements y1:fc. If the density p(x0:fc|yi:fc) is known, then the estimates of functions of x0:fc can be calculated by mathematical expectation. For example, an estimate which minimizes the mean-squared error of the observations, is the minimum-mean-squared error (MMSE) estimate
(X0:fc)p(x0:fc|y1:fc) =   /x0:fcp(x0:fc|y1:fc)dXo:fc,
where (•) denotes the expectation operator. Alternatively, maximum a posteriori estimate MAP(x0:fc) can be obtained by choosing values for x0:fc which maximize the posterior p(x0:fc|yi:fc), i.e.,
MAP(x0:fc) = argmax[p(xo:fc|yi:fc)].
xO:fc
Note that during tracking we are usually interested only in the current state xfc of the target and not the entire sequence of states x0:fc. In Bayesian terms, this means that we only require a marginal distribution p(xfc|y1:fc) at time-step k. Furthermore, it is a beneficial if the posterior can be calculated recursively using only the posterior from the previous time-step p(xfc_i|yi:fc_i) and the current observed measurement yk. This procedure is referred to in the literature as the Bayesian recursive filter and is derived next.
2.2    Recursive solution
To derive the recursive solution for calculating the marginal posterior p(xfc|y1:fc), we start from a complete posterior p(x0:fc|yi:fc) and apply the Bayes rule
p(XM,yM) = r(**,y.*) .                                (2.1)
The numerator of (2.1) can be further factored using the chain rule into
P(x0:fc, yi:fc) = P(yfc|yi:fc-1, X0:fc)p(x0:fc|yi:fc-l)p(yi:fc-l),                   (2.2)
20
Recursive Bayesian filtering
while the denominator of (2.1) is factored into
p(y1:k)=p(yk|y1:k-1)p(y1:k-1).                               (2.3)
Plugging (2.2) and (2.3) back into (2.1) and cancelling the term p(y1:k-1) we arrive at
p(yk|y1:k-1, x0:k)p(xk|xk-1, y1:k-1)p(x0:k-1|y1:k-1) p(x0:k|y1:k)=                                                                             ,(2.4)
p(yk|y1:k-1)
which is recursive in that p(x0:k|y1:k) is calculated from p(x0:k-1|y1:k-1).
The posterior of the current state xk is obtained by marginalizing the complete posterior p(x0:k|y1:k) (2.4) over the sequence of past states x0:k-1,

P(xfc|yi:fc) =        p(xo:k\yi:k)dXO:k-l,
which gives p(xk|y1:k)
/l?(yfc|yi:fc-l,Xo:fc)l?(xfc|Xo:fc_i,yi:fc_i)l?(Xo:fc_i|yi:fc_i)dXo:fc_i
p(yfc|yi:fc-i)
p(yfc|yi:fc-i,xt)/p(xfc|xfc_1,y1:fc_1)p(xfc_1|y1:fc_1)dxfc_1
p(yfc|yi:fc-i)                                     ~"
Note that although (2.5) does admit to a recursive form, it cannot be calculated recursively in general. The reason is that the terms p(yk\yi-.k-i, xfc) and p(xfc|xfc_i,yi:fc_i) are conditioned on the entire sequence of observations yi:fc_i and thus require storing the sequence y^-i1. Therefore, to make (2.5) a proper recursion, additional restrictions have to be imposed.
The first restriction is that, given the current state xfc, the current observation yfc is conditionally independent from all previous observations y1:fc_i:
p(yfc|yi:fc-i,xfc)=p(yfc|xfc).                                   (2.6)
The second restriction is that, given the state xk-1 from the previous timestep (k - 1), the cu observations y1:k-1,
step (k - 1), the current state xk is conditionally independent from all previous
p(xfc|xfc_i,yi:fc_i)=p(xfc|xfc_i),                                (2.7)
iNote that the normalization p(yk\yi-.k-i) in (2.5) also depends on the sequence yi:fc-i; however, it is not troublesome, since it is a constant and can be calculated by integrating the numeratorp(yfc|yi:fc-i) = /p(xfc|yfc|yfc,yi:fc-i)p(xfc|yfc,yi:fc-i).
_
_
2.3 Bayes filter for a stochastic dynamic system
21
which means that the state sequence is a first-order Markov process. In the filtering literature, p(yfc|xfc) and p(xfc|xfc_i) are commonly referred to as the likelihood function2 and the transition distribution, respectively. Using the restrictions (2.6) and (2.7) we can now rewrite (2.5) as a proper recursion
p(yfc|xfc) J>(xfc|xfc_i)j?(xfc_i|yi.fc_i)dxfc_i
p(Xfc|yi:fc)  =  ----------------------j>(y*|yi*-i)-----------------------'             (2.8)
Equation (2.8) is the well-known recursive Bayes filter and constitutes a central equation for many probabilistic schemes for tracking and estimation in stochastic dynamic systems. In the following we describe how different terms in the recursion (2.8) conceptually relate to a class of stochastic dynamic systems, which we consider in this thesis.
2.3    Bayes filter for a stochastic dynamic system
The stochastic dynamic system is defined by a set of, possibly nonlinear, system equations
xk   =   f(xk-1,vk),                                        (2.9)
yk   =   g(xk,nk),                                         (2.10)
where (2.9) is the process evolution model and (2.10) is the measurement process model. According to (2.9) and (2.10) the state xk-1 evolves through a system transition function f(xk-1, vk) which is driven by the process noise vk. The hidden state xk is then observed through the observation function g(xk,nk), where nk is the observation noise.
An equivalent probabilistic model of the dynamic system (2.9, 2.10) is shown in Figure 2.1 as a graphical model. The transition density p(xk|xk-1) is completely defined by the transition function f(xk-1, vk) and the process noise distribution p(vk), while the likelihood function p(yk|xk) is specified by the observation function g(xk,nk) and the measurement noise distribution p(nk).
2Note that p(yk|xk) is the probability of observing the measurement yk, given that the system is in the state xk. However, p(yk|xk) is also the likelihood of the system being at state xk, given the observation yk; this is the reason why p(yk|xk) is often referred to as simply the likelihood function in filtering theory.
22
Recursive Bayesian filtering
Observed		yfc-i iL			yfc iL p(yfc|xfc)			yfc+i iL		
Unobserved	> /	Xt-    i		)         * (	X'-		\         » /	Xfr  i   1		\                                 „
	* \	^		P(xfc|xfc_!)	^		/          * \	^		/                                 *
Figure 2.1: A graphical model of the stochastic dynamic system with the unobserved states xfc and the observed measurements yk. The hidden state xfc of the system evolves with time as a partially observed first order Markov process according to the conditional probability density p(xfc|xfc_i). The observations yk are conditionally independent given the state and are generated according to the probability density function p(yfc|xfc).
In the literature, the recursion of the Bayes filter (2.8) is usually broken into two steps: prediction and update. In the prediction step, the posterior p(xfc_i|yi:fc_i) from the previous time-step (k - 1) is propagated through the dynamic model p^lx^) to yield the predictive distribution p(xfc|yi:fc_i) using the Chapman-Kolmogorov relation:
p(xfc|yi:fc_i) =   /p(xfc|xfc_1)p(xfc_1|y1:fc_1)dxfc_1.                (2.11)
Note that (2.11) is just the integral from the right-hand side of the Bayes recursion (2.8). In the update step the predictive distribution is updated using the likelihood p(yfc|xfc) associated with the observed measurement yk, and normalized to yield the new posterior p(xfc|y1:fc),
p(xfc|yi:fc)
p(yfc|xfc)p(xfc|y1:fc_1) p(yfc|yi:fc-i)        ;
(2.12)
where the normalization function is calculated by integrating the numerator over all values of xfc, i.e., p{yk\yi:k-i) = /p(yfc|xfc)p(xfc|y1:fc_1)dxfc.
At first glance, the calculation of recursion (2.11, 2.12) may appear a straightforward matter. Unfortunately, it is merely conceptual, and analytic solutions exist only for a handful of systems. The reason is that the integrations
2.4 Historical approaches to recursive filtering
23
involved in calculating p(xk|y1:k-1) and p(yk|y1:k-1) generally do not have closed-form solutions. Furthermore, even if the posterior p(xk|y1:k) can be found, estimates of the target’s state, such as MMSE, are likely to be intractable [57]. A number of approaches have been proposed in the literature to make the recursions (2.11, 2.12) tractable. We give a brief overview of the better known ones next.
2.4    Historical approaches to recursive filtering
When the measurement and the system models are linear and the noise in both models is Gaussian, closed-form solutions of the integrals in the recursions of the Bayes filter exist. In particular, if we start from a Gaussian prior p(x0), and assume that the prediction step (2.11) and the update step (2.12) of the recursive Bayes filter are linear operations on Gaussians, then the resulting posterior is a Gaussian as well. Thus the recursions (2.11, 2.12) can be interpreted as a recursive estimation of the posterior’s p(xk|y1:k) mean and covariance. The filter that emerges from this is the well-known Kalman filter [76], which was originally derived by R. E. Kalman in the early sixties. Since then, it has been applied extensively to various filtering problems and many variants have been presented; see, e.g. [75, 133, 73, 67, 126].
In practice, the use of the Kalman filter is limited by the nonlinearity and the non-Gaussian nature of the physical world. As an attempt to deal with these nonlinearities, a modification called the extended Kalman filter (EKF) was proposed (see, e.g. [67, 5]). This filter locally linearizes the system transition and measurement models and then applies the standard Kalman filter to analytically calculate the posterior p(xk|y1:k). Usually, a first-order Taylor series expansion is used for the linearization. To obtain better approximations, higher-order variants of EKF have been proposed [155]. However, the additional complexity, that these higher-order extensions entail, has prevented their widespread use [6]. There are two sources of inaccuracy in the EKF: The first is the linearization of the system and measurement models, which is carried-out only at a single point and does not take into account the uncertainty associated with that point. The second problem is the assumption that the prior and posterior are Gaussians. In fact, the nonlinearities in the models will result in posteriors being non-Gaussian and sometimes even multi-modal after the propagation step.   This
24
Recursive Bayesian filtering
undermines the Gaussian assumption in the EKF and consequently results in degraded performance or even in complete failure of the tracker.
As an attempt to better deal with the nonlinearities of the measurement and the system transition model, the so-called unscented Kalman filter (UKF) was proposed by Julier and Uhllman [69, 71, 70, 72] in the late nineties. The basic idea behind the UKF is to avoid the linearization of the measurement process and system dynamics all together. The posterior is, like in EKF, still approximated by a Gaussian, but this Gaussian is encoded through a set of points called sigma points. These points are carefully chosen along the principal axes of the Gaussian such that they can capture its covariance and the mean value. In the propagation step, the points are propagated through the true nonlinear system and capture the mean and the covariance of the new posterior accurately up to the third order (in terms of Taylor expansion) [69]. This is a significant improvement in comparison with the standard EKF, whose approximation is accurate only up to the first order. Furthermore, the computational complexity of the UKF is lower than that of the EKF, since it does not require linearization. Separately from the UKF, another filter called the central-difference Kalman filter (CDKF) has been proposed [79, 117]. Der Merwe [161, 159] observed that the UKF and CDKF algorithms are related through an implicit use of a technique called weighted statistical linear regression [94], which is responsible for determining the locations, as well as the weights of the sigma points. Thus these filters have been given a common name: sigma-point Kalman filters (SPKF) [159]. Although SKPF is more accurate than EKF, it still assumes that the posterior can be approximated by a single Gaussian. As such, it is still prone to failure in certain nonlinear, non-Gaussian problems which involve multi-modal and/or heavy-tailed posterior distributions.
As an alternative to approximating the posterior using only a single Gaussian, a Gaussian sum filter (GSF) was proposed [150, 4], which applies a mixture of Gaussians to better model the multimodality in the posterior distribution. The motivation behind this was the fact that any non-Gaussian density can be sufficiently well approximated by a sufficiently large number of Gaussians [165]. Thus the GSF was conceptually presented as a bank of extended Kalman filters, where at each time-step, each component of the Gaussian mixture was propagated using an EKF. One drawback of the filter was that the number of components in the mixture had to be specified in advance. Another drawback was that in cases
2.4 Historical approaches to recursive filtering
25
where the observation noise and/or the system noise were also approximated by a Gaussian mixture model, the number of components in the posterior increased exponentially after each tracking iteration and methods for reducing the number of components had to be used. Furthermore, since the GSF used EKF to propagate the components of the mixture, it inherently suffered from the problem of the first-order linearization in the EKF.
In contrast to the parametric models, which use Gaussians or Gaussian mixtures to calculate the recursions in the Bayes filter analytically, non-parametric methods, called grid-based methods, have been proposed to calculate the recursions of the Bayes filter numerically. These approaches are based on discretizing the continuous state-space into a predefined set of cells and then evaluating the posterior only at these cells. In some variations, the posterior is approximated using a discrete distribution over the grid cells [149]. Other approaches use splines [27], step functions [86], or apply quadrature methods [166]. The advantage of these approaches is that they simplify the recursions of the Bayes filter, as each point on the grid is updated independently of the others. Therefore, given enough cells, they hold a potential of approximating fairly complex distributions. A significant drawback of the grid-based methods is, however, that they require specifying the number of cells in advance, and the grid has to be sufficiently dense to allow good approximation of the posterior. With increasing dimension of the state-space, the computational burden then increases dramatically. Another drawback is that, because the state of the target is moving through the continuous state-space, the grid also has to move along with the target, and has to scale accordingly to cover the significant probability-mass corresponding to the target’s current state. This is a nontrivial task which further increases the computational burden of the method.
In the late eighties and early nineties, with the advent of increased computational power in computers, significant steps have been made toward calculating the recursions in the Bayes recursive filter by means of simulation [83, 82]. As a result, a number of methods have been independently developed in such fields as statistics, econometrics, engineering and computer science, that were based on evaluating the integrals of the Bayesian recursion by Monte Carlo sampling. These methods are commonly known as sequential Monte Carlo methods or particle filters and are described next.
26
Recursive Bayesian filtering
2.5    Monte-Carlo-based recursive filtering
The aim of recursive Bayesian filtering is to calculate the posterior distribution p(xfc|yi:fc) of the target’s state xfc given some observations yk. Once the posterior is known, various estimates of the target’s state can be calculated. For example, the minimum-mean-squared-error (MMSE) estimate of the target state can be obtained by taking the expectation (xfc)p(xfc|yi:fc). However, the MMSE estimate involves solving a high-dimensional integral which is often intractable. A common approach to evaluate intractable high-dimensional integrals is to apply Monte Carlo integration, where the integrals are replaced by summations over discrete sets of points. In this section we will first briefly overview the theoretical background of Monte Carlo integration and show how it is used to approximate the recursive Bayes filter. We will point out some drawbacks and implementation issues of this approximation and discuss solutions and simplifications proposed in the literature. In light of these, we will finally present a well-known approximate recursive Bayes filter called the particle filter.
2.5.1     Perfect Monte Carlo sampling
First we will discuss how Monte Carlo methods can be conceptually used to approximate integrals. For the sake of clarity we will drop the subscripts (•)*, which indicate the time-steps, for now and reintroduce them later, when we consider application of these methods to Bayesian filtering.
Note that in tracking we are generally interested in calculating expectations over some posteriors p(x|y), which are integrals of type
/(/)=   //(x)p(x|y)dx,
where /(•) is p(x|y)-integrable function of x. In a perfect Monte Carlo sampling, the integral /(/) can be approximated as follows. First, N independent samples {x(i)}i=i...w are drawn from the posterior p(x|y). Using these samples, the posterior can be approximated by the following empirical estimate
1    N pw(x|y) =      J]čxW(x),                             (2.13)
i=\
which is essentially a set of Dirac-delta functions 6x(i) (x) located at sampled points xW.    Replacing p(x|y) with its empirical estimate pw(x|y), the integral /(/)
2.5 Monte-Carlo-based recursive filtering
27
(2.13) can now be numerically approximated with IN(f) using the Monte Carlo integration
W)   =    [ f(x)pN(x\y)dx
1    N    f =    mE/ /(^)^x(o(x)dx
1       W
=   ir/(x('>).                                   (2.14)
The validity of the approximation /(/) « IN(f) is guaranteed by the strong law of large numbers ([43], page 7), which states that the average of many independent random numbers with a common mean and finite variance converges to the common mean
lim IN(f) = 1(f), with probability one.
Moreover, if the variance a) of f(x) with respect to p(x|y) is finite, the central limit theorem tells us that by increasing N, the difference between the integral /(/) and its approximation IN(f) approaches a normal distribution with variance
^N[IN(f) - I(f)]^Af(0,o-2f),                        (2.15)
N—>oo                J
where => denotes convergence in distribution ([43], page 7). From (2.15) we see that the accuracy of the estimator IN(f) increases with the number of samples and does not depend directly on the dimension of the integrand3. In contrast, in a deterministic numerical integration the accuracy of the integral decreases as the dimension of the integrand increases ([43], page 7).
The above discussion tells us that if we are able to provide independent samples from p(x|y), then the integrals of type (2.13) can be easily approximated. In practice, however, p(x|y) will be multivariate, nonstandard and known only up to a proportionality constant. Thus sampling from p(x|y) directly is usually impossible and alternative solutions are required. One common solution is to use importance sampling.
3Note, however, that ?f2 in (2.15) may still grow appreciably with the dimension of the integrand [57].
28
Recursive Bayesian filtering
/(/)=   //(x)p(x|y)=   //(x)^^|^p(x|y)dx=^  f /(x)g(x|y)w(x)dx!
2.5.2    Importance sampling
Let p(x|y) be some distribution which is difficult to sample from, but can be evaluated up to a proportionality constant as
p(x|y) = Cpp(x|y).                                       (2.16)
Let g(x|y) be another distribution which is easy to sample, can be also evaluated point-wise, and has the same support as p(x|y)4. In the literature, g(x|y) is usually called the importance function or the proposal distribution.
With the above definitions, the integral (2.13) can be rewritten
1                      1
g(x|y) <^X|y)          Cp
where p(x|y) was absorbed into
w(x) = ^J4.                                 (2.17)
Qwy)
The proportionality constant Cp is obtained by integrating both sides of (2.16),
Cp=   /p(x|y)dx=   /g(x|y)w(x)dx, and /(/) can be rewritten as
/(/) =             |              //(x)g(x|y)w(x)dx.                  (2.18)
Since g(x|y) can be easily sampled, it can be approximated by an empirical Monte Carlo estimate (2.13)
1    N gw(x|y) =      J]čxW(x),                             (2.19)
i=\
where {xW}i=1:W is a set of independent and identically distributed (i.d.d.)
samples from g(x|y).    The integral /(/)  (2.18) can now be approximated by
IN(f), where
1
 qN(x\yjw(x)dx 1
W)          f          ^                /(x)gw(x|y)w(x)dx
11        N
N
y  /(x«)W(x«).
ffiti«'(x(<))JV
4We say that q(x|y) has the same support as p(x|y) when it is nonzero at least for all those x, for which p(x|y) is nonzero as well.
_
2.5 Monte-Carlo-based recursive filtering
29
By cancelling the term L in the above equation and taking the normalization into the sum, we can further rewrite the integral as
lN (f) = yN   /(x«V«    ,   w® =      !f(xW)     ¦               (2.20)
It is beneficial to note that, in terms of a perfect Monte Carlo sampling, the integral IN(f) can be viewed as an expectation of /(x) under the following empirical distribution
1    N č„(x|y) = ^ j>(x«)<JxW(x).                             (2.21)
i=\
Since IN(f) from (2.20) is an equivalent of IN(f) from (2.13), pw(x|y) is an empirical approximation to the reference distribution p(x|y). Therefore, importance sampling can be viewed not only as means of approximating the integrals of type (2.13), but actually as a methodology for generating empirical distributions from a reference distribution without sampling it. The reference distribution is thus approximated by a random measure, which is completely specified by a set of N sample-weight pairs {x«, w«}i=i:W. The latter are commonly referred to as the particles.
Note that for a recursive Bayesian filtering, we require the probability density functions to be calculated sequentially. In the following we show how this is achieved by applying the above results from importance sampling.
2.5.3    Sequential importance sampling
In a general Bayesian filtering, we seek the posterior p(x0:fc|yi:fc) over a sequence of states x0:fc = {x0,..., xfc} of the target. JYom the results of importance sampling we can approximate the posterior by an empirical distribution pN(xo:k\yl:k) specified by N weighted particles {x^fc, w^j^i^. Note that a particular sample xjjfc in the i-tii particle presents a trajectory through a target’s state-space and is drawn with probability w& from the posterior. The posterior p(x0:fc|yi:fc) is in general difficult to sample from and thus, in the spirit of importance sampling, another distribution q(xo:k\y1:k) is defined from which samples can be drawn easily. This new distribution, the proposal distribution, is further factored such that it admits to a recursive form:
g(x0:fc|yi:fc) = g(xfc|xo:fc-l,yi:fc)g(xo:fc-l|yi:fc-l).
30
Recursive Bayesian filtering
The i-th sample x^fc can then be generated from the posterior p(x0:fc|yi:fc) by choosing xf,^ ~ q{^}k_1\yi-.k), sampling a state x^ ~ qM^k^yi-.k) and concatenating x^fc = {x^x^Lj- The corresponding weight, which makes x^fc an equivalent sample from p(x0:fc|yi:fc), is then calculated according to (2.17)
9(Xo:i|yi:fc)
FYom the recursive Bayes filter (2.4), with constraints (2.6, 2.7), the posterior p(xo?fclyi:fc) is factored as5
p(x-oi\yi:k) oc p(yfc|x^))p(xJ;)|xJ;l1)p(xW .Jy^fc-l)
p(yfc|x«)p(x«|xLi)p(x^_1
Wl       OC   -------------777------777----------------------------------777-------------------------.                                     (2.2S)
k                 »»              \           W
and the weight update equation (2.22) is rewritten as
Since the second fraction on the right-hand side of (2.23) is just the i-th particle weight from the previous time-step, we can further rewrite (2.23) as
{%)      {%) p(yfc|xfc )p(xfc |xfc_i)
"'fc   K wj;-i--------pf—Pi—-----------'                                 (2.24;
9(Xfc|Xo:fc-l>yi:fc)
Note that (2.29) is not a proper recursion, since the proposal distribution is conditioned on y1:fc, and thus requires storing the entire sequence of past observations. To avoid that, the proposal is usually modified [6] into
g(x^|Xo:Ll,yi:fc) = ^(x^lXfcll,yfc)-                                   (2-25)
FYom this definition of the importance function, the equation for calculating the nonnormalized importance weights6 wf takes the following form
wf = wf rty^fWf^\                (2.26)
g(xjfet;|Xfc-i»yfc)
and the posterior p(xfc|y1:fc) in the current time-step is approximated by the following random measure
1    N Mxfc|yi:fc) =      J]4\«(xfc),                            (2.27)
i=\
5We have dropped the term p(yk|y1:k-1) from the denominator of the posterior since it is independent from the sequence of the states x(0i:)k and is constant at k.
6By nonnormalized importance weights we refer to a set of weights which do not sum to one.
2.5 Monte-Carlo-based recursive filtering
31
where wf are now normalized importance weights, i.e., wf = wf (Yl?=i wf)~l. An approximate recursive Bayesian filter that follows from (2.27) and (2.26) is called the Sequential Importance Sampling (SIS) algorithm and is summarized in Algorithm 2.1.
Input:
•  Posterior from the previous time-step p(xfc-i|yi:fc-i)»{xL1,4-i}ili
Output:
•  Posterior from the current time-step
v{*k\yi-.k) ~ {*f ,wf}f=l
1.   For i = 1 : N,
•   Sample a new particle: x^ ~ q{*k\*f_vyk).
•   Assign a weight: wf = ^l/^1^^''^^.
2.   For i = 1 : N,
•  Normalize weights: w^ =     **   (i).
3.   The new random measure is an empirical approximation to the true posterior:
{x^W^i^l ~Kxfc|yi:fc).
Algorithm 2.1: Sequential Importance Sampling (SIS) algorithm.
2.5.4    Degeneracy of the SIS algorithm
The SIS algorithm has a major practical drawback. Since we are recursively multiplying weights from the previous time-step with the weights in the current time-step, the variances of weights tend to increase [43] with time and thus the approximation of the posterior deteriorates. Furthermore, any estimate based on this approximation (e.g., MMSE stimate) will deteriorate as well. Authors of [6] have reported that in practice, after a few iterations, most of the normalized
32
Recursive Bayesian filtering
weights will be close to zero. From a computational standpoint this means that most calculations are expended on calculating the weights whose contribution to the estimates are negligible. This effect is in the literature commonly known as the problem of degeneracy and is usually tackled by the following two approaches [43, 6, 57]:
1.  Choice of a suitable importance function.
2.  Application of resampling.
Choice of the importance function
In view of alleviating the degeneracy of weights in SIS, the optimal choice of the importance function is that which minimizes the variance of the true weights, conditioned on the previous state and the current measurement [6]. It has been shown in [43] that in terms of variance minimization, the optimal importance function is q(xk|xk-1,yk) = p(xk|xk-1,yk). There are two cases in which such an importance function can be used [6]. One is when the statespace of xk is discrete and finite. The other is when the measurement model is linear and the noise in the system and measurement model is Gaussian. Unfortunately, as we will see in later sections, these restrictions do not apply for our applications of visual tracking. Various authors have proposed methods that use other, suboptimal, importance functions. Among them are the auxiliary particle filter [129], Gaussian mixture sigma-point particle filter [160], annealed particle filter [41], partitioned sampling [107] and layered sampling [124], to name but a few.
An alternative choice of the importance function is the prior transition model
q(xk|xk-1)=p(xk|xk-1).                                  (2.28)
What is appealing about this choice is that it requires us to sample from the system dynamic model, which is often easy. Furthermore, the weight update equation (2.26) simplifies to
~ (i)        (i) wk   = wk_
1P(yfc|xfc).                                         (2.29)
However, choosing the prior for the importance function may also have a deteriorative effect on the performance of SIS. This happens in situations when
2.5 Monte-Carlo-based recursive filtering
33
the likelihood p(yfc|xfc) is quite peaked in comparison to the prior p(xfc|xfc_1). An example of such a situation is illustrated in Figure 2.2. Only a few particles generated from the prior are located at the peak of the likelihood function. As a result, majority of the particles have a weight very close zero. Note that this is in fact equivalent to the problem of degeneracy which we are trying to avoid. Nevertheless, the simplicity of implementation, and weight calculation that such a choice of importance function offers, makes it a very popular choice among practitioners.
p(xfclxi-i)
Figure 2.2: Samples (depicted by circles) are drawn from the prior p(a;fc|aL1). A weight is assigned to each sample according to the likelihood function p(yk\xk). The weight of each sample is indicated by the radius of the circle; a large radius corresponds to a large weight, while a small radius corresponds to a low weight. Note that since only a few samples are generated at the peak of the likelihood function, the majority of samples have weights close to zero.
Application of resampling
The second approach to alleviating the effects of degeneracy is to use resampling whenever a significant degeneracy is observed. Recall that the posterior p(xfc|y1:fc) is approximated by the following random measure
pjv(xfcly1:fc) = VW
wkl)S <i)(xk).
Resampling can then be thought of as generating N new samples frompw(xfc|y1:fc) and is described by a mapping
(xW w (i)\N
1
r~(i)
lXfc    >   NI:
N
Pri^
——   Xl       I
W
k   ¦
(2.30)

_
34
Recursive Bayesian filtering
under which all new particles ±f have equal weights L and are still approximately distributed according to the original posterior p(xfc|y1:fc). In other words, resampling avoids degeneracy of weights by generating a new random measure
pjv(xfclyi.fc) = VW       6 w(xfc) •        z-^i=i N xfc
by selecting the particles with high weights multiple times and discarding those with smaller weights. Therefore, the particles are herded in regions of high probability density of the posterior p(xfc|y1:fc). Resampling is required especially in those situations, when the particle set becomes degenerated. In the literature [101, 13, 6], an effective sample size Neff, sometimes also called the survival diagnostic [106], is proposed as a measure of degeneracy
Neff =      AT  1   ,:    ¦                                     (2.31)
 TIM*)2
Therefore, when Neff falls below a predefined threshold, the particle set is resampled. Many resampling schemes have been proposed in the literature: stratified sampling and residual sampling [101], systematic resampling [83], deterministic resampling [106], resampling based on ordered statistics [134, 31], residual sampling [101], regularized sampling [47], etc. We use a deterministic resampling [106] in our implementation, since it is simple to implement and the complexity of the algorithm is O(N). The deterministic resampling is summarized in Algorithm 2.2.
It is important to note that while resampling does alleviate the problem of degeneracy it introduces other problems. One is that particles with higher weights are chosen multiple times in the resampling step. This can reduce the diversification of the particles, since the new particle-set contains multiple copies of the same particles. This is known in the literature as sample impoverishment and can be critical especially when the system noise is small [6]. In those situations, after a few iterations, the entire particle set will collapse into a single point [6]. Nevertheless, despite the drawback of possible sample impoverishment, resampling deals well with the problem of degeneracy and has as such become an integral part of all modern particle-set-based recursive Bayesian filters.
2.5.5    Particle filters
If resampling (e.g., Algorithm 2.2) is used with the SIS algorithm (Algorithm 2.1) and the effective sample size Neff (2.31) is used to decide when to resample, we
2.5 Monte-Carlo-based recursive filtering
35
Input:
Posterior p(xfc|yi:fc) « {x^, wf}*=l
Output:
Resampled posterior p(xfc|yi:fc) « {xj^, ^}Li
1.   Generate a cumulative distribution {S}f=l from the particle weights
.   Äi)        ^       „,W
2.   Initialize: I = 1
3.   For i = 1 : JV,
•  while (jj >cW) : / + +.
•  choose 4° = 4° and set wk] = 7r-
Algorithm 2.2: Deterministic resampling. obtain the so-called generic particle filter, which is summarized in Algorithm 2.3.
A special variant of the generic particle filter has become popular and widely used for visual target tracking in the last decade. This variant assumes the following two simplifications:
•   The prior is used in the importance function:
g(xfc|xfc_1,yfc)=p(xfc|xfc_1).
•   Resampling is executed at each iteration (this is equivalent to setting Nthres = oo in the Algorithm (2.3).
The variant that uses the above two simplifications has emerged under various names like the bootstrap particle filter (BPF) [53] (published in 1993) and the Condensation algorithm [63] (published in 1996) to name just the more visible


36
Recursive Bayesian filtering
Input:
•   Posterior from the previous time-step
P(xfc-l|yi:fc-l) ~ {^t-Vw{k-l}f=l
Output:
•   Posterior from the current time-step
1.   Evaluate Neff using (2.31).
2.   If Neff > Nthres
Resample {xj^, wjjjüi using Algorithm 2.2.
r   W           (i)   iM             c~(i)       I ^N
3. For i = 1 : N,
new particle
(i)       p(yk|x(ki))p(x(ki)|x(ki-) 1)
• Sample a new particle x^ ~ g(xfc|xj2.i,yfc)
•   Assign a new weight ^
^(xfc lxfc_i>yfc)
4.   For i = 1 : iV.
•  Normalize the weights: wf =     **' m
E™=i *r
5.   The new random measure is an empirical approximation to the posterior:
W    Wn
{X^,«;^}^! Wp(xfc|yi:fc)
Algorithm 2.3: A generic particle filter.
two. The bootstrap particle filter is summarized in Algorithm 2.4. Note that since the particle filter approximates the posterior over the target’s state by a weighted sample-set, p(xfc|y1:fc) « {*f Mk)^, the minimum-mean-squared-error (MMSE) estimate xfc is approximated as
N
±k = ^*fwf.                                    (2.32)
i=\
2.5 Monte-Carlo-based recursive filtering
37
Input:
•  Posterior from the previous time-step
P(xfc-l|yi:fc-l) ~ {*(k-Vwik-l}?=l
Output:
•  Posterior from the current time-step p(xfc|y1:fc) ~ {^f ,wf}^x
1.   Resample {xj^, wjj-Jui using Algorithm 2.2.
 W         (i)   iM           (~{i)       l iN
•   {x>k-vwk-i}Z:i -»¦ K-P jv}i=l
2.   For i = 1 : JV,
•  Predict by sampling a new particle: x^ ~p(xfc|xj2.i)
•  Update by assigning a new weight: wf = v{yk\^k)
3.   For i = 1 : JV,
•  Normalize weights: w^ =      ** _U)
4.   The new random measure is an empirical approximation to the true posterior:
{*t\w(k}?=i ~Kxfc|yi:fc)
Algorithm 2.4: Bootstrap particle filter.
All trackers which will be developed in the following chapters are based on the bootstrap particle filter. We therefore conclude this chapter with a detailed illustration of a BPF iteration on an example of one-dimensional tracking.
Example 1. Consider an example of estimating the horizontal position of the frog in Figure 2.3 from time-step (k - 1) to time-step k. At time-step k we know an empirical estimate of the posterior over horizontal position from the previous time-step (k - 1) in the form of a particle set p(xk.l\yl.k_l) ~ {x^w^-Li (Figure 2.3a). We wish to calculate an approximation to the the posterior in the current time-step k. The main steps of the BPF iteration are illustrated in Figure 24 and are listed as follows:
38
Recursive Bayesian filtering
k	r^
Of	
	
'^^^-4^-w	5   O
">	
	X
(a)
(b)
Figure 2.3: Example of estimating horizontal position of the frog in two consecutive images (a) and (b). The posterior p(xk-1|y1:k-1) at (k - 1) is approximated by particles which are depicted by circles below the frog. The aim
of tracking is to estimate the posterior in the current time-step k, as the frog changes its position.
•   We start with the posterior j?(ajfc-i|j/i:fc_i) from the previous time-step (k - I) (Figure 24, stage 0).
•  First, all particles are resampled by a deterministic resampling (Algorithm 2.2). The resulting empirical distribution {x^,^}^ is still an approximation to p{xk-i\y1:k-i). The particles that originally had higher weights are multiplied many times, while those with small weights are discarded (Figure 24, stage 1).
•   Then each particle is simulated according to the system’s dynamic model p{xk\xk-i). The resulting random measure {xf,±}f=l is an approximation to j?(ajfc|j/i:fc_i), which is the prediction of the posterior from the previous time-step (Figure 24, stage 2).
•  Finally, each particle is assigned a weight according to the likelihood function p(yk\xk). All weights are normalized such that they sum to one, and the resulting particle set {xf,wf}f=l is an empirical approximation to the posteriorp(xk\yl:k) (Figure 24, stage 3).
The average position (MMSE estimated state) xk can be calculated from the approximation to p(xk\yl:k) and is depicted in Figure 24 (left column, last row) by an orange circle.
2.5 Monte-Carlo-based recursive filtering
39
Figure 2.4: An illustration of a single iteration of the bootstrap particle filter (Algorithm 2.4) with four stages from Example 1. The posterior over the frog’s horizontal position (stage 0) is presented by twelve particles, where each particle is depicted by a purple circle and the weight of the particle is indicated by the circle’s radius: the larger the radius, the larger the weight. The three subsequent stages are resampling (stage 1), prediction (stage 2) and update (stage 3). The left column shows how particles evolve with respect to the frog’s position, while the right column shows different stages in more detail. The average state (position) of the frog in the current time-step is depicted by an orange circle overlaid on the particles in the left column of stage 3.
40                                                                             Recursive Bayesianfiltering
Blue flower, red thorns. Blue flower, red thorns. Blue flower, red thorns. Oh, this would be so much easier if I wasn’t color-blind!
Donkey in Shrek
Chapter 3
Color-based tracking
One of the essential parts of visual tracking is the visual model of the target, which allows us to evaluate whether a target is present at a given location in the image. It might seem that we require a very detailed visual model to discriminate the tracked object from its surrounding. However, too detailed visual models are not appropriate when tracking nonrigid objects such as people. While moving, their appearance changes and the visual model has to adapt to these changes, which is not easily achievable in a robust way. A different approach is to use a less detailed visual model, e.g., a color histogram, which can account better for the small changes of the target’s appearance. However, during tracking, the target’s appearance will still change slowly and a mechanism to adapt the reference visual model to these changes is still needed. Furthermore, the color-based tracking may degrade when the target is located on a cluttered background1, since the measurements provided by the color model become too ambiguous. In this chapter we will propose a color-histogram-based model of the target, which can use the information from the background to improve tracking performance, and can adapt to the temporal changes of the target’s appearance.
The outline of this chapter is as follows. In Section 3.1, a histogram-based appearance model is presented and a new measure of presence which uses the background information is developed. In section 3.2, the probabilistic model of the proposed measure of presence is derived. In Section 3.3 we propose a principle to harvest additional information from the background to mask out
1The term background clutter is used throughout this thesis to refer to the visually-similar background pixels near the tracked object. For example, when the target’s texture is very similar to the texture of the background, we say that the background clutter is severe.
41
42
Color-based tracking
Figure 3.1: The images (a,b,c) show persons in different poses with their torsos approximated by an ellipse. Image (d) shows an ellipse approximating a palm.
pixels that do not belong to the target. In Section 3.4 we propose and discuss a scheme for adaptation of the visual model to the changes in target’s appearance. A color-based probabilistic tracker is presented in Section 3.5 and results of the experiments with the tracker are presented in Section 3.6. This chapter is summarized in Section 3.7
3.1    Color histograms
The color histograms encode the color statistics of a given region, which contains the target, by constructing a non-parametric model of the color distribution within that region. As we have seen in the related-work (Section 1.1.1) various models that encode the outline of the target have been proposed in the literature, which could be used for encoding the region of interest. However, these models usually have to be built for a specific class of objects and do not generalize well to other objects which may be of different shapes. Another drawback of the shape models is that they require the tracked objects to be sufficiently large in images so that enough edge information can be obtained. Another drawback is that they are prone to failure when the input images are noisy and the tracked object changes its shape rapidly. We thus consider an ellipse as a less detailed model for encoding the region containing the object, which can approximate fairly well various orientations of human body as well as body parts such as palms. For example, see Figure 3.1. In our application, the state of the tracked object xk is thus parameterized by an ellipse xk = (xk, yk, ak, bk) with its center at (xk, yk), and with parameters ak and bk denoting the width and the height, respectively.
3.1 Color histograms
43
When constructing the color histogram, it is beneficial to assign higher weights to the pixels that are closer to the center of the ellipse and lower weights to those farther away. This can help achieve some robustness in the appearance model since the pixels that are closer to the center are less likely to be affected by the clutter stemming from the background pixels or the nearby objects. Furthermore, if some a-priori knowledge of which pixels are not likely to belong to the target is available, it should be used in the construction of the histogram, i.e., those pixels should be ignored. An elegant way of discarding the pixels that do not belong to the target is to use a mask function, which assigns a prior zero weight to those pixels that are not likely to have been generated by the target and a weight one to all the other pixels.
Let E = (x,y,a,b) be an elliptical region at some state x = (x,y,a,b). The RGB color histogram with 5 = 8x8x8 bins hx = {hi}?=1, sampled within the ellipse E, is then defined as
hi = fh ^  K(u)M(u)8i(b(u)),                               (3.1)
uS-E
where u = (x,y) denotes a pixel within the elliptical region E. &(•) is the Kronecker delta function positioned at histogram bin i, and 6(u) G {1...B} denotes the histogram bin index associated with the color of a pixel at location u. K(-) is an Epanechnikov weighting kernel [165], as in [35, 118], positioned at the center of the ellipse, M(u) is the a-priori binary mask function, and fh is a normalizing constant such that Y^Lihi = 1. For an illustration of the weighting kernel, the mask function and the principle of sampling a histogram, see Figure 3.2.
3.1.1     Color-based measure of presence
To localize an object during tracking, we require a measure of presence which provides a belief that an object with some reference histogram hfc is located at a given state. When some additional information about the color of the background is available, it should be used in the evaluation of this belief. Indeed, a body of literature exists on modelling the background using more or less complicated statistical models and on how to use these to discern the target from the background. However, to be more general, we assume that only a simple model is available, e.g., an image of the background without objects.
44
Color-based tracking
(a)
(c)
(b)
(d)
Figure 3.2: The target’s state is modelled as an elliptical region (a). The weighting kernel and its parameters are sketched in (b). The mask function which masks out the background pixels is shown in (c). The target histogram is sampled by considering only pixels which are visible through the mask function and assigning a weight to each pixel according to the weighting kernel (d).
We thus define the measure of the presence which evaluates whether a target with a predefined reference histogram hk is present at some state xk as
D(hA,hib;hß) = /3-1p(hA,hib;hß),
(3.2)
where hA and hB are histograms sampled at the state xfc on the current and the precalculated background image, respectively. ß is the ratio between the number of pixels within the elliptical region of xfc that are assigned to the foreground by the mask function M(u) and the number of those assigned to the background. p(hA, hfc; hß) is the normalized distance between hA and hk, given the background histogram hB, defined as
g(hAlhk)
p(hAlht;hB)
(3.3)
^g(hBlhk)2 + g(hAlhk)2
where ^(hi,h2) = 1 - ^V^J^i is the Hellinger distance [11, 137, 154]. Note that the Hellinger distance is related to the well-known Bhattacharryya coefficient [74], it is a metric and has a geometrical interpretation. For a detailed discussion on this distance please see [11, 154].
_
3.2 The likelihood function
45
(a)                                    (b)                                           (c)
Figure 3.3: The histogram of a basketball player was sampled within the ellipse (a). A non-normalized distance g(hA,ht) calculated at different positions around the player is shown in (b). The result for the proposed normalized distance measure p(hA,ht;hB) is shown in (c). For better visualization, one minus the distance measures are shown. The correct position of the player is depicted by a white arrow and a circle in each image. Notice how the mode corresponding to the selected player is more pronounced with respect to the background clutter when the normalized distance is used (c).
The normalization term in (3.3) incorporates the distance between the reference color model and the background color. Such a normalization aids tracking when the target’s color is similar to the background. In these situations the measure (3.3) favors those regions for which the reference color model is closer to the color in the current image than to the background color. In practice, when using a particle filter, this effectively attenuates the background clutter and forces particles closer to the target. An example of the normalized and non-normalized distance measure is shown in Fig. 3.3
3.2    The likelihood function
To carry out the update step of the particle filter, e.g., (Algorithm 2.4), we require the probability density function (pdf) of the presence measure (3.2). Due to the lack of a rigorous physical background with which the analytical form of this pdf could be derived, an experimental approach was chosen instead.
The pdf of (3.2) was estimated from a large number of examples of moving persons; see Fig. 3.4a for examples. Some of these persons were tracked using a
46
Color-based tracking
@   E
(a)
•       •
°D(h"ht;hB) °'6
(b)
Figure 3.4: The left-hand image shows examples of persons which were used to estimate the empirical probability density function of the measure (3.2). The right-hand image shows this function in the form of a histogram, and overlaid is the maximum-likelihood fitted gamma probability density function.
simple tracker from the literature [118]. In cases when the simple tracker failed, we have resorted to manual marking. This enabled us to obtain approximately 115,000 values of the measure (3.2), which are visualized by the histogram in Fig. 3.4b.
To identify the best model for the gathered data, a model selection was carried out using the Akaike information criterion (AIC) [3] among four models of the probability density functions: exponential, gamma, inverse gamma and zero-mean Gaussian2. The test with the AIC showed that the gamma function explained the data significantly better than the other functions. For this reason the probability density function of measure (3.2) was chosen in the form of
p(yt\xt) oc D(hA, ht; h)^e----V"*".
(3.4)
The parameters 71 and 72 were estimated from the data using the maximum-likelihood approach. The estimated values were 7l = 1.769 and 72 = 0.066. For more details on the model selection results, please see the Appendix A.
2Only the main results of the model selection using Akaike information criterion are reported here and the reader is referred to Appendix A for more details.
3.3 The background mask function
47
Note that the gamma distribution assigns small probability values to those values of the measure (3.2) that are very close to zero. At first glance this may not seem reasonable for the purposes of object localization; however, if we observe a moving nonrigid object such as a person in two consecutive time-steps, it is more likely that the person’s appearance will change within these two time-steps than stay the same. This is an inherent property of the visual dynamics of nonrigid objects and is implicitly captured by the likelihood function (3.4).
3.3    The background mask function
While color histograms are powerful color cues for tracking textured objects, they can fail when the object is moving on a similarly textured background. This is usually due to their inability to capture the spatial relations in the texture, and the fact that they are always sub-sampled in order to increase their robustness. There is, however, still some useful information left in the current and the background image - the difference between the two. By thresholding this difference image with some appropriate threshold Kk, we can construct a mask image that filters out those pixels which are likely to belong to the background. Since, in general, the illumination of the observed scene is non-uniform in space and time, the threshold has to be estimated dynamically for the tracked object.
We base our method for mask generation on the following observation. When an object described by an ellipse is located on the visually-similar background, some small portion of the pixels within the ellipse come either from the background, or come from the object but are very similar to the background. We thus assume that in those situations we can assign some percentage r]0 of the pixels within the object’s ellipse to the background.
Let xfc denote the estimated state of the target at time-step k. Let hA and hB be the histograms sampled at that state on the current image A(-) and the background image B(-), respectively. The current mask function is defined as
f 1    ;    \\A(u) - B(u)\\ \ 0   ;           otherwise
MD(u)=      1   ;    UH-B(u)\\>nk
0   ;            otherwise           '                       ( . )
where u is some pixel, || • || is the L2 norm and Kk is the threshold in the current time-step.
48
Color-based tracking
Note that we have to generate the mask function only when the tracked object is similar enough to the background. Thus in practice we verify after each tracking iteration the similarity between the target’s visual model and the background. If this similarity is within a predefined bound (g(hA, hB) < Qthresh), then the mask is generated in the next time-step. The threshold Kk+l for the next time-step is estimated as the threshold that would in the current time-step produce a mask function such that a predefined percentage r]0 of the pixels within the ellipse of the current state xfc would be assigned to the background. Otherwise, if g(hA, hB) > Qthresh, the mask is not generated in the next time-step and MD(u) = 1 in (3.5) for all pixels u.
3.3.1     Implementation of dynamic threshold estimation
The procedure for calculating the appropriate Kk+l is as follows. Let dE = {diE}T=i be a set of pixel-wise intensity differences between the current image A(-) and the background image B(-) calculated within the ellipse E of the estimated state xfc and let these differences be ordered in an ascending order. We define the cumulative function corresponding to the ordered differences as
1 QE = V-----,                                            (3.6)
and then the smallest / at which c{l+l)E exceeds r]0 is the required threshold «t+i = K*M, or formally
Kt(Vo) = diE   ;   clE < Vo < c{l+l)E.                                  (3.7)
The parameters r]0 and gthreah were estimated empirically by manually selecting persons as they moved on a heavily cluttered background. We have observed that usually when a person is located on a cluttered background, at least 25 percent of pixels within the ellipse describing that person can be assigned to the background and thus we have chosen r]0 = 25%. The parameter which decides when to generate the mask was set in a similar empirical manner, and was set to Qthresh = 0.8. The procedure of threshold approximation is illustrated with the Example 2.
Example 2. Consider a situation where we observe a yellow person on a yellow background (Figure 3.5a). Assume that at time-step k - 1 we know the state of
3.4 Adaptation of the visual model
49
that person described by the ellipse E (Figure 3.5a) and the background image is available (Figure 3.5b). We want to determine a threshold Kk(r]0) such that 25 percent of pixels within the ellipse will be masked out by the resulting mask function. We thus first calculate the ordered set of intensity differences between the current image and the background within ellipse E in form of a histogram of differences (Figure 3.5c). The corresponding cumulative function is shown in Figure 3.5d. From the cumulative function we read off the difference value at which the function exceeds r]0 = 0.25 and we have Kk(0.25) = 55. For illustration, the mask function calculated from the current image and the background image is shown in Figure 3.5e and the masked image of the person is shown in Figure 3.5f
3.4    Adaptation of the visual model
When a nonrigid object such as a human body moves through an observed scene, its texture varies due to the non-uniform lighting conditions, influences of the background, and variations of the person’s pose. Therefore, the color model, i.e., the person’s current reference histogram hfc, has to be able to adapt to these changes. In addition, if the current state of the tracked person is likely to have been falsely estimated, and the corresponding ellipse does not fully contain the person, then the reference histogram should be updated by a very small amount, or not at all. Otherwise, it should be adapted by some larger amount.
Let xfc be the estimated state of a person at the current time-step. The histograms hA and hB are sampled at that state on the current and the background image, respectively. The adaptation equation then follows a simple auto-regressive form
hfc = afchA + (1-afc)hfc_1,                                  (3.8)
where hk_x is the reference histogram from the previous time-step. The intensity of the adaptation is defined with respect to the normalized distance (3.3) between hA and hfc_! as
«fc = ?max ¦ (1.0 - p(hA, hfc_i; hB)),                            (3.9)
where ?max denotes the maximal adaptation. In all experiments in this thesis we use ?max = 0.05. This means, that, at each time-step, we allow at most 5% adaptation of the reference histogram.   The following example was designed to
50
Color-based tracking
SXE 1    °     B	B(-) r\E Ik
(a)
die (c)
(e)
SO     100    150    200    250    300    360    400
(b)
die
(d)
(f)
50            200            250            300            350            4O0
Figure 3.5: Example of estimating the threshold Kk at r]0 = 0.25. A person denoted by an ellipse is shown in (a) and the background image is shown in (b). The histogram of intensity differences is shown in (c) and the resulting cumulative function is shown in (d). The threshold corresponding to r]0 = 0.25 is found at diE = 55 in (b). The resulting mask function is shown in (e), and (f) shows the mask function superimposed over (a).

3.4 Adaptation of the visual model
51
provide some insight into the choice of maximal adaptation value. For better illustration we will assume that the parameter ak is constant and equal to the maximal adaptation, i.e., ak = 0.05.
Example 3. Assume we observe a controlled environment with a single object illuminated by a white light. At time-step k = 0 we record its reference histogram h0 and then turn on color lights such that the apparent color of the object changes. The new color histogram of the object is hc and remains constant for all k > 0. The model hk begins to adapt to the new histogram hc according to (3.8)
hfc = (1 - afc)hfc_i + akhc.
Since we assume that ak and hc are constant, we can rewrite the reference histogram at time-step k as
hfc = (1 - ak) k h0 + (1 - (1 - ak) k )hc.                        (3.10)
Now say we want to know what portion of the new histogram hc has been incorporated into the current reference histogram h k after 25th time-step3 at a constant adaptation value ak = 0.05.  Using (3.10) we have
h25   =   (1 - 0.05)25h0 + (1 - (1 - 0.05)25)hc «   0.28h0 + 0.72hc,
which means that after 25 time-steps the reference histogram hfc contains approximately one quarter of the reference histogram at k = 0, h0, and three quarters of the new histogram hc.
A further insight into the influence of different values of ak is provided in Figure 3.6(a). The graph shows the percentage of the histogram hc in hk after 25th time-step for different values of ak. We see that by setting ak = 0.2, the reference histogram becomes practically equal to the current histogram after 25th time-step. This means that at ak = 0.2, and at frame rate of 25 frames per second, the reference histogram completely adapts to the instant change within the time-span of one second. Figure 3.6(b) shows percentage of the new histogram in hk at a constant adaptation ak = 0.05 with respect to time. We see that the reference histogram would completely adapt to an instant change of person’s appearance within 120 time-steps, which, at frame rate of 25 frames per second, amounts to 5 seconds.
3Note that, in most of our experiments reported in this thesis, 25 frames corresponds to one second of video.
52                                                                                        Color-based tracking
 °-9"      /
 0.8 -      /  0.7- 0.6 -   /
 °-5 " I  °-4 "I
 0-3 -
 0.2  0.1
„I------------------------------------------------------------------------
0         0.1        0.2        0.3       0.4       0.5       0.6       0.7       0.8       0.9
a
(a)
Figure 3.6: Percentage of the new histogram in the reference histogram after 25 time-steps with respect to constant adaptation parameter a (a). Percentage of the new histogram in the reference histogram at a = 0.05 with respect to number of time-steps is shown in (b).
3.5    Color-based probabilistic tracker
In the previous sections we have presented an adaptive color-histogram-based visual model of the target, which is able to harvest information about the color of the background with the aim to better discriminate the target from the background. In this section we show how the proposed color model can be used for tracking within the framework of particle filters.
Prior to tracking, we first calculate the background image. This can be achieved for example, either by taking a single image of an empty scene, or to record a sequence of images and then construct a median image pixel-wise along the temporal component. The tracker is initialized by selecting the target and recording the reference histogram. This can be done, for example, by manually clicking the target. Then at each tracking iteration the following steps are executed. First, a mask function is calculated using the dynamically estimated threshold from the previous time-step, as discussed in Section 3.3. Then an iteration of the bootstrap particle filter (Algorithm 2.4) is executed using the likelihood function derived in Section 3.2. The current estimate xfc of the target state (position and size of the ellipse) is calculated as a MMSE estimate (2.32). A histogram is then sampled at the estimated state xfc and used to adapt the
 0.9  0.8  0.7

 0.3  0.2  0.1
Number of observed images
(b)

3.5 Color-based probabilistic tracker
53
reference histogram according to Section 3.4. Finally, in the last step of tracking iteration, the threshold for generating the mask image in the next time-step is calculated if necessary according to Section 3.3. This procedure is summarized in Algorithm 3.1.
Initialize:
•  Calculate the background image, e.g., pixel-wise by means of a median filter along temporal axis.
•  Initialize the tracker by selecting the target (e.g., manually).
Tracking:
1. For i = 1 : N,
•  Calculate the mask function according to Section 3.3.
•  Execute an iteration of the bootstrap particle filter (Algorithm 2.4) using the likelihood function from Section 3.4.
•  Estimate the current state x^t by MMSE estimate (2.32) from the particle filter.
•  Sample the histogram at x^t and adapt the model to that histogram as in Section 3.4.
•  If required, estimate the threshold for the mask function MD(u) in the next time-step (Section 3.3).
Algorithm 3.1: Color-based probabilistic tracker.
Note that the proposed color model and the probabilistic tracker in Algorithm 3.1 require us to set some parameters. These are: the parameters of the likelihood function, dynamic background subtraction and the parameters for adaptation of the reference histogram. In the previous sections, we have discussed how these parameters have been selected, and in all the following experiments in this thesis their values are kept constant. For a better overview, we summarize them in Table 3.1.
54
Color-based tracking
Table 3.1: Parameters of the color-basedprobabilistic tracker. Parameters of the color-based likelihood function (3.4)
•     (71,72) = (1.769,0.066)
Parameters for dynamic background subtraction (Section 3.3.1)
•     (r?o,Pthresh) = (0.25,0.8)
Maximal adaptation of the color model (3.9)
•     ?max = 0.05
3.6    Experiments
A set of experiments was performed to evaluate how the proposed likelihood function (Sect. 3.2), the adaptation scheme (Sect. 3.4) and the dynamic background subtraction (Sect. 3.3) influence the performance of tracking persons in images. In these experiments we have compared the proposed color-based tracker from Algorithm 3.1, we denote it by Tcol, to a reference tracker Tref from the literature [118]. The reference tracker Tref was also a color-histogram-based particle filter, however, Tref did not account for the background in the likelihood function and the adaptation, and did not use the dynamic background subtraction scheme. Both trackers used 50 particles in the particle filter. The target motion was modelled by two independent nearly-constant-velocity (NCV) models in the horizontal and vertical direction. The dynamics of the target’s size was modelled by two independent random-walk (RW) models in horizontal and vertical direction. For a detailed description of the NCV and RW models, and assumptions they enforce on dynamics, see Appendix B.1 and Appendix B.2, respectively. The parameters of the NCV models were set for each set of the experiments separately according to the expected size of the tracked object in video. The parameters of the RW models were set such that the target’s size would change approximately at most by 15 percent in between consecutive timesteps.
Three experiments were considered for comparing performance of Tcol and Tref . The first experiment considered tracking a person in a heavily cluttered environment from a bird’s-eye view. The second experiment considered tracking a person from a side-view in a less cluttered environment, in a situation where that person is occluded by another, visually similar, person. The last, the third, experiment considered tracking a person from a side-view moving camera.  In
3.6 Experiments
55
each experiment, a single person was manually initialized and tracked throughout the sequence. If the person was lost during tracking, the tracker was manually reinitialized and tracking continued. In all experiments, the background image was calculated prior to tracking. This was done by calculating the mean value of each pixel along its temporal axis over entire sequence of images.
Experiment 1: Tracking in a severe clutter
The first experiment considered tracking a 12 × 12 pixels large goal-keeper on a heavily cluttered background in a 773 images long sequence of a handball match (Figure 3.7a). On average, the tracker Tcol required only one user intervention during tracking, while the tracker Tref required approximately 14 interventions to maintain a successful track throughout the sequence. Note that the goal-keeper was visually very similar to the goal-area, however, Tcol was still able to maintain a good track. The only time that Tcol failed was when the goal-keeper was located in front of the goal-pole (Figure 3.7c). At that point he was simply too similar to the background and the tracker drifted to his legs, which was considered as a loss of track.
(a)                             (b)                             (c)
Figure 3.7: A blue goalkeeper on a heavily cluttered background (a) is depicted by a white arrow. The background image is shown in (b). The single situation where Tcol failed by drifting to the goal-keeper’s legs is shown in (c); the white ellipse depicts the falsely estimated position.
Experiment 2: Tracking with occlusion
The second experiment was conducted on a 600-images-long sequence taken from Pets2006 [127] surveillance scenario (see Figure 3.8). This experiment considered
56
Color-based tracking
tracking a person dressed in black, which was walking from the left part of the scene to the right, stopped, waited there for a while, and then continued walking back to the left. At the end of the sequence, the person was occluded once by another visually similar person walking in the opposite direction (Figure 3.9a, middle column). On average, the reference tracker Tref required approximately 5 user interventions during tracking when the person was standing still on a dark background (e.g., Figure 3.8b), and another intervention when the person was occluded by the other person ; this situation is shown in Figure 3.9b. On the other hand, the proposed tracker Tcol was able to track the person throughout almost the entire sequence. However, when the person got occluded by another person of similar color, the measurements became too ambiguous and tracking failed (see, for example, Figure 3.9c).
(a)                             (b)                             (c)
Figure 3.8: Examples of surveillance video used in the Experiment 2. In images (a,b) the tracked person is depicted by a white arrow. Image (c) shows the approximation of the background image, which was used in the proposed tracker Tcol.
Experiment 3: Tracking from a moving camera
The third experiment considered tracking the upper body of a handball player in a 700-images-long sequence, which was recorded from a moving camera (see, for example, Figure 3.10). The tracked player was dressed in yellow and was thus visually similar to some parts of the court. Figure 3.10c shows the approximation to the background image that was used by the proposed tracker Tcol. Note that since the camera was not static the image is not the actual background image. However, it still captures some salient visual properties such as the color of the court (lower part of Figure 3.10c) and the color of tribune (upper part of Figure 3.10c).
3.6 Experiments
57
(a)
(b)
(c)
Figure 3.9: Images show frames 627, 638 and 664 from the surveillance video, used in the Experiment 2, in which a person is occluded by another, visually similar person (a). Results of tracking using the reference tracker Tref and the proposed tracker Tcol are shown in (b) and (c), respectively, with the estimated locations of persons depicted by white ellipses.
The proposed tracker Tcol was able to use the information provided by the approximation of the background to prevent losing the player and drifting to the floor. Figure 3.11a shows such a situation, where the tracked player was moving partly on the yellow part of the court. While Tcol was able to keep track of the player (Figure 3.11c), Tref failed (Figure 3.11b). In addition to the moving camera, the tracked player was in several occasions occluded by other, differently colored, players. In most cases this caused a failure of Tref, while, again, Tcol was still able to track the player through the occlusion. One such situation is shown in Figure 3.12. On average, the tracker Tref required approximately 8 interventions
58
Color-based tracking
to maintain a successful track. On the other hand, Tcol comfortably tracked the player throughout the entire recording.
(a)                             (b)                             (c)
Figure 3.10: Examples of the video recording used for tracking from a moving camera (Experiment 3). Images (a,b) show the yellow player (depicted by a white arrow) which was tracked. Image (c) shows the approximation of the background image used by the proposed tracker.
3.7    Conclusion
In this chapter we have proposed several improvements of the color-based tracker. The first was the color-based measure of the target’s presence (Section 3.1.1) that uses information from the approximation of the background image to reduce the influence of the background clutter. Using model-selection methodology and the maximum-likelihood estimation we have proposed the likelihood function (Section 3.2) which can be used to probabilistically interpret values of the proposed target’s presence measure. However, in cases when the target is moving on those parts of the background that are very similar to the color of the target, the proposed measure of presence may not be discriminative enough. For that reason we have considered the background subtraction, i.e., generating a mask function with aim to mask out pixels in the current image that do not belong to the target. In situations where lighting of the scene is changing, or the camera is moving or shaking, it is usually difficult to obtain an accurate model of the background. For that reason we have considered using only a simple approximation to the background and proposed a dynamic background subtraction (Section 3.3), which is our second improvement of the color-based tracker. The mask function is generated by evaluating the similarity between the tracked target and the background model and is thus specialized to the tracked
3.7 Conclusion
59
(a)
(b)
Figure 3.11: Images show frames 347, 356, 364 and 378 from a video in which the yellow player moves on a yellow background (a). The reference tracker drifted to the floor and tracking failed (b), while the proposed tracker still maintained a successful track (c). The estimated player’s location is depicted in each image by a white ellipse.
target. The third improvement was the selective adaptation of the target’s visual model (Section 3.4), which is used to guard against updating the color-based visual model in situations where the position of the target is falsely estimated, or when the target is occluded by another object. We have also shown how these improvements are probabilistically combined within the framework of particle filters into a color-based probabilistic tracker.
Experiments were conducted with tracking people from different views using a static and moving camera. The proposed tracker was compared to a reference tracker that was conceptually similar to our own tracker, however, it did not utilize the proposed improvements. Results have shown that the proposed tracker resulted in a more stable tracking, requiring significantly less user interventions than the reference tracker in situations when the target was moving on a heavily
60
Color-based tracking
(a)
(b)
(c)
Figure 3.12: Images show frames 589, 618, 633 and 645 from a video where the yellow player is occluded by the other, differently colored, players. During the occlusion the reference tracker failed to track the yellow player (b), while the proposed tracker maintained a successful track (c). The tracker-estimated player’s location is depicted in each image by a white ellipse.
cluttered background. The experiment of tracking a person from a moving camera showed that the proposed tracker outperformed the reference tracker even though the background image could not be accurately estimated. While the proposed tracker was able to cope with short-term occlusions when the tracked person was occluded by a differently colored person, its performance degraded in situations where the color of the occluding person was visually-similar. In the next chapter we will propose a solution which can help resolving such situations.
We have gotten stuck half-way in our transition from the planned and command economy to a normal market economy. We have created... a hybrid of the two systems.
Boris N. Yeltsin (1931 - 2007)
Chapter 4
Combined visual model
The color-based probabilistic visual model, which was proposed in the previous section can significantly increase the tracking performance in cases when the color of the background is similar to the color of the tracked object. However, an inherent drawback of the color-based (or edge-based for that matter) models is that tracking may fail whenever the target gets into a close proximity of a visually similar object. We have observed one such situation in the second experiment of the previous chapter. There we were tracking a black person through occlusion by another black person. In that situation, the visual information provided from the color-based visual model became too ambiguous and after occlusion was over, the tracker continued to track the wrong person. In applications such as video surveillance, visual human-computer interface and tracking in sports, the camera is sometimes positioned such that the scene is viewed from the side only. In these situations, complete occlusions between visually similar objects are quite frequent, which is bound to deteriorate the tracking performance of any color-based tracker.
In this chapter we argue that ambiguities which arise from the visual similarity between different objects can be resolved to some extent if we account not only for the target’s color but also its low-level temporal component of the visual information – the optical flow. We illustrate our argument in Figure 4.1 with an example of tracking person’s right hand after it has been occluded by the left hand. Figure 4.1d shows the likelihood function corresponding to the reference color model of the hand. Observe that the mode of the likelihood function stretches over both hands. This is causing a persistent ambiguity in the hand’s position and it is very likely that tracking will fail. Now assume that we calculate
61
62
Combined visual model
the optical flow in the image (Figure 4.1b) and that we know the tracked hand is moving to the left. If we now visualize the likelihood function reflecting which flow vectors in (Figure 4.1b) are pointing to the left, we get the local-motion likelihood function in Figure 4.1e. Note that while one of the modes of the function corresponds to the tracked hand, the other hand is hidden since it is not moving to the left. The local-motion likelihood function assigns significant likelihood also to some other parts of the image which do not correspond to the tracked hand and is on its own also introducing some ambiguity in the position of the tracked hand. But if we combine (multiply) the color likelihood with the local-motion likelihood, we get the combined likelihood function as shown in Figure 4.1f. Note that now only the mode corresponding to the tracked hand remains and that the maximum corresponds to the hand’s location (Figure 4.1c) – we have thus successfully resolved the ambiguity of the hand’s position.
The outline of this chapter is as follows. In Section 4.1 we discuss a method for estimating the optical flow. Based on this method we define the local-motion feature in Section 4.2, where we also derive a probabilistic model for the local motion. The combined, color- and motion-based probabilistic tracker is derived in Section 4.3 and results of experiments are reported in Section 4.4. We summarize this chapter in Section 4.5.
4.1    Optical flow
When an object moves through an observed environment, its motion is perceived by the camera as the changes in intensity of the points in a 3D space. If we assign a velocity vector to each of those 3D points, then we obtain the so-called motion field that tells us how each point in the scene is translating with respect to the camera (see [84], pages 278–285). An image is generated in the camera through a projection of the light rays of the 3D points onto the CCD sensor. While one source of changes in the intensity of 3D points is the motion field, the other is the changing illumination of the scene. These two sources induce temporal patterns of intensities at the pixels on the CCD sensor. The induced patterns are called the apparent motion, and an estimate of the apparent motion, which we calculate from a sequence of images, is called the optical flow.
A widely used hypothesis for calculating the optical flow is that the brightness of a local structure (e.g., a pixel on a moving object) remains constant over
4.1 Optical flow
63
(a)
(b)
(c)
(d)
(e)
(f)
Figure 4.1: A person’s hand (depicted by the ellipse) is tracked after being occluded by the other hand (a). The local motion, the optical flow, induced by both hands is depicted in (b). The color likelihood and the local-motion likelihood are shown in (d) and (e), respectively. Bright colors correspond to high likelihood, while dark colors correspond to low likelihood. Image (f) shows the combined, color and motion, likelihood from (d) and (e). The position corresponding to the maximum likelihood in (f) is depicted by a white cross in (c).
64
Combined visual model
consecutive images, while its location may change [16]. This is known as the data conservation constraint. However, since calculation of the flow vector from a single pixel is ill-posed, additional constraints are required. Alternatively, one can assume data conservation within a patch rather than a single pixel and then solve the optical flow vectors by least squares [103]; this is the basis of the well-known Lucas-Kanade method. However, since a nonrobust least-squares method is used in the Lucas-Kanade method, the resulting optical flow is still poorly estimated in regions of homogeneous intensity, since there the calculation is again ill-posed and very susceptible to noise. This drawback can be remedied to some extent by noting that the intensity patterns of the near-by pixels in the image tend to move similarly. The constraint which assumes similar velocities in the neighboring pixels is called the spatial coherence constraint. Horn and Schnuck have proposed a method for calculating the optical flow in [58] which simultaneously applies the data conservation constraint as well as the spatial coherence constraint. This method produces a better optical flow than the Lucas-Kanade method but at a cost of significantly increasing its computation time. In practice, however, both, the data conservation and spatial coherence constraints are violated in the presence of multiple motions, and since the underlying calculations are usually carried out using the least squares, the estimated optical flows may still contain errors. Some researchers [16] therefore replace the nonrobust least squares with robust estimators to handle multiple motions, or explicitly model the discontinuities by generative models [17]. Usually, calculations of flow vectors with respect to various constrains involve iterative procedures and are thus computationally very demanding. For that reason, Zach et. al. [173] have recently proposed efficient implementations that exploit features of specialized graphic cards to calculate optical flow with spatial and data conservation constraint in real-time.
To illustrate how constraints influence the calculation of the optical flow we compare the optical flows estimated using the Lucas-Kanade approach [103] and Michael J. Black’s [16] multiple motion method in Figure 4.2. Note that while the Black’s method (Figure 4.2c) produces a much better optical flow than Lucas-Kanade (Figure 4.2b), the time for calculating the Black’s flow in Figure 4.2 was considerably longer than for calculation of the Lucas-Kanade. On a Celeron 1.5GHz, 500MB RAM laptop, a C++ implementation of the Black’s [18] method
4.1 Optical flow
65
(a)                                     (b)                                     (c)
Figure 4.2: Comparison between the optical flow calculated using the Lucas-Kanade algorithm and using the M. J. Black’s multiple motion algorithm. The reference image is shown in (a), while results of the Lucas-Kanade and the Black’s method are shown in (b) and (c), respectively.
required twelve seconds to calculate the flows of a 288 × 360 pixels images, while Lucas-Kanade [122] required only one second.
Due to a considerable time consumption of the methods that involve multiple constraints, we will focus in the following on using simple methods like the Lucas-Kanade algorithm to obtain the information about the local motions in the image. Thus, rather than calculating the dense optical flow (i.e., optical flow vectors of each pixel) using a time-consuming method, we will identify which points in the image contain enough local texture, and calculate flow vectors only at those points. We will be interested in calculating a sparse flow rather than dense, which will allow us to use computationally less demanding algorithms for flow calculation.
4.1.1    Calculating the sparse optical flow
There are two conceptual ways in which the optical flow at a given location in the image can be defined. One way is to consider from which location in the previous image the pixel in the current image has been translated. The other is to consider to which location in the next image the pixel in the current image will be translated. In our implementations we consider the first definition of the optical flow. Thus the optical flow at a given point is calculated by estimating the optical flow vector from the current image back to the previous image, and reversing its direction.   This approach was chosen since it relates the current
66
Combined visual model
(a)                        (b)                         (c)                         (d)
Figure 4.3: Example of the optical flow calculation. Images (a) and (b) are consecutive images from a video. To calculate the flow in image (b), we first calculate the optical flow (c) at each point of the current image (b) back to the previous image (a). The direction of this flow is then reversed. The resulting flow overlaid over the current image (b) is shown in (d). For better visibility, only every fifth flow vector is shown.
image to the previous image and is in better agreement with the way in which we will use the optical flow for tracking. An example of the described definition of the optical flow is shown in Figure 4.3.
In our implementation, the optical flow is estimated using the pyramidal implementation [22] of the well known Lucas-Kanade method. The pyramidal implementation starts by first constructing a multi-resolution pyramid for a given pair of images, where at each level of the pyramid, images are resized to half of their size at the previous level. The optical flow vectors are initially estimated using the Lucas-Kanade method at the coarsest (the lowest) level, and used to initialize calculation of the flow vectors at a higher level. In this way, at each level, the flow vector estimate obtained from the previous level is refined. This allows for efficient calculation of the flow vectors in presence of small as well as large motions.
As already discussed, the Lucas-Kanade calculation of the flow vector at a given pixel does not take into account the neighboring flow vectors, and the resulting flow field is usually noisy (see, e.g., Figure 4.2b and Figure 4.3d). Worse yet, this method fails to provide a reliable estimation of the flow vectors in regions with poor local texture. We therefore apply Shi-Tomasi feature detection [143] to determine locations with sufficient local texture, and calculate the optical flow
4.1 Optical flow
67
only at those locations. The Shi-Tomasi feature at location (x,y) is defined by the smallest eigenvalue of the covariation matrix of gray-scale intensity gradients, which are calculated in the neighborhood of (x, y). The location (x, y) is accepted as a valid Shi-Tomasi feature if the smallest eigenvalue exceeds a predefined threshold fth. An example of valid Shi-Tomasi features and the corresponding flow vectors is shown in Figure 4.4. Note that a majority of the flow vectors that correspond to the valid features in Figure 4.4e appear to be well estimated.
(a)                                 (b)
(c)                                  (d)                                  (e)
Figure 4.4: Two consecutive images from a video of a squash match are shown in (a) and (b). The optical flow estimated for image (b) using the Lucas-Kanade method is shown in (c). The valid Shi-Tomasi features are depicted by white color in image (d). The flow vectors from (c) that correspond to the valid Shi-Tomasi features in (d) are shown in (e). For clarity, only every third flow vector is shown.
Using the sparse optical flow defined above, we can now derive in the following the local-motion feature, which will be used later to provide the motion information for a probabilistic tracker.
68
Combined visual model
4.2    Optical-flow-based local-motion feature
Let vk(x,y) = [r,<f>] be the optical flow vector at location (x,y) in the current image with amplitude r and orientation 0. Note that in the literature the optical flow vectors are usually written in cartesian coordinates. The reason why we use the polar notation instead is that in the following we treat the angle of the optical flow separately from its amplitude. The local-motion feature vE = [rE,<f>E] of a region E is then encoded as the weighted average of the flow vectors
ve = /,"1    Yl   vk(x,y)K(x,y),                           (4.1)
(x,y)€E'
where E' G E is a set of detected Shi-Tomasi features within region E, K(x,y) is the Epanechnikov kernel [140] used to assign higher weights to those flow vectors that are closer to the center of E, and fv = L(*,*)€L' K(x,y) is a normalization term such that fvJ2{x,y)eE,K(x,y) = 1. To avoid the pathological situations associated with vectors with amplitude zero, the summation (4.1) is carried out in cartesian coordinates.
4.2.1     Local-motion likelihood
Let vref = [rref, 0ref] be a reference vector, which models the target’s local-motion, and let vE be a local-motion vector calculated within region E. We define the angular and amplitude similarity G^ and Gr, respectively, between vref and vE
as
G,(vE,vref)=                 ;   rE>5*Ar«*>5*   ,             (4.2)
1         ;           otherwise

4vg,Vref) 7T
gp(vb,vref) =    teS  ; rE > Sth V rref > sth
0        ;           otherwise
such that G(-, •) G [0,1], Gr(; •) G [0,1], Z(-, •) is the angle between two vectors, • | is the Lx norm, and 5th is a threshold below which the vectors are considered as having amplitude zero. If region E contains no valid Shi-Tomasi features, the vector vE is undefined and the similarities are G^ = 1.0 and Gr = 1.0.
We have observed, in a preliminary study [88], that the probability density functions (pdf) of (4.2) and (4.3) can be well approximated by exponential distributions. However, in practice we approximate the current reference motion
4.2 Optical-flow-based local-motion feature
69
using motions observed in previous time-steps. This may impair the quality of tracking whenever the target suddenly significantly changes its motion. To cope with such events, we introduce a uniform component to the probability density function. The joint probability density function of (4.2) and (4.3) with parameters 9=[X4>, \r,wnoise] is then defined as
p(G^,Gr|#)oc(l-wnoise)e S   vJ+wnoise,                     (4.4)
where A^ and Ar are the parameters of the exponential distributions and 0 < wnoise < 1 is the weight of the uniform component.
4.2.2    Adaptation of the local-motion feature
After each tracking iteration, the current state xfc of the target and its current velocity vfc are calculated, e.g., via the MMSE estimate (2.32) from the particle filter. The new region E containing the target is determined and the local-motion vector vEfc = [0Efc,rEfc] (4.1) is estimated. If the region E contains at least one valid Shi-Tomasi feature, then vEfc is used to adapt the reference local-motion model vref = [0ref,rref]. This is achieved by applying an autoregressive scheme
0+f     =     /Vfef + (1 - ß^k)(f)Ek:
4f   =   ßrkr;ef + (1 - ßrk)rEk,                                (4.5)
where the subscripts (•)" and (•)+, respectively, denote the reference model prior and after the adaptation. The variables ß^ and ßrk are the current adaptation intensities
ßi,k   oc   p(G^(vfc,vEfc),0|o),
ßrk   oc   p(0,Gr(vfc,vEfc)|#),                                  (4.6)
where p(-,-\0) is defined in (4.4), and ß^ G [0,1], ßrt G [0,1]. If the region E does not contain any valid Shi-Tomasi features, then vEfc is undefined and the reference is not adapted.
From (4.6) it follows that the reference local-motion model is adapted to the local changes in the target’s motion only when the velocity, with which the tracker predicts the target is moving, is approximately in agreement with the observed local-motion at the current estimated state. Otherwise the adaptation is low, since the target is probably being occluded by another object.
70
Combined visual model
4.3    The combined probabilistic visual model
We derive the combined color/local-motion-based visual model by extending the color-based visual model from Chapter 3 to account for the local-motion. Under the assumption that the target’s color properties are independent of its motion, the likelihood function for the particle filter can be written as
p(yfc|xfc) =p(yfccol|xfc)p(yfcmot|xfc),                             (4.7)
where p(yfccoi|xfc) is the color likelihood at state xfc, and p(yfcmot|xfc) presents the local-motion likelihood at that state. Note that, in the case of the purely color-based visual model from Chapter 3, Tcol, the likelihood function is equal to p(yfcCoi|xfc). The combined color/local-motion-based visual model, we denote it by Tcmb, is then obtained by replacing the likelihood function in Tcol by (4.7) and setting
p(yfcmot|Xfc) =p(G^(vXfc,Vref),Gr(vXfc,Vref)|#).                           (4.8)
In the equation above, p(-,-\0) is defined in (4.4), vXfc is the local-motion (4.1) sampled at state xfc, and vref is the reference local-motion. While, during tracking, the color histograms are sampled within the elliptical regions of the hypothesized states x^, we have found that, in practice, it is sufficient to sample the local-motion feature (4.1) within the rectangular regions superimposed over the ellipses.
The combined color/local-motion-based probabilistic tracker can be derived from purely color-based tracker (Algorithm 3.1) by using the likelihood function defined in (4.7, 4.8) and the local-motion adaptation scheme from section 4.2.2. The combined probabilistic tracker is summarized in Algorithm 4.1.
4.4    Experiments
Several experiments were conducted using examples from surveillance, sports tracking and hand tracking (Figure 4.5) to compare the proposed combined tracker Tcmb from Algorithm 4.1 to the purely color-based tracker Tcol proposed in Algorithm 3.1. Both trackers used 50 particles in the particle filter and a nearly-constant-velocity dynamic model. All recordings were taken at the framerate of 25 frames per second, except for the recording which was used for hand tracking; that recording was taken at 30 frames per second.
4.4 Experiments
71
Initialize:
• Initialize the tracker by selecting the target. (e.g. manually)
Tracking:
• For k = 1,2,3,...
1.  Execute an iteration of the color-based particle filter from Algorithm 3.1 using the likelihood function p(yk|xk) defined in (4.7) and the current reference local-motion vref .
2.  Estimate the current MMSE state x^k and the current velocity v^k using (2.32).
3.  Estimate the new reference vref according to section 4.2.2. Algorithm 4.1: The combined color/local-motion-based probabilistic tracker.
The Shi-Tomasi feature detection from section 4.2 was performed using 3 x 3 pixels neighborhoods and only features whose smallest eigenvalue exceeded Lth = 10-3 were accepted. The size of the integration window in the Lucas-Kanade optical flow calculation was set to 9 x 9 pixels. The amplitude threshold used in (4.2) and (4.3) was set to 5th = 10-2 pixels. In the experiment with hand tracking, a two-level pyramid was used to calculate the optical flow. The pyramids were not used in the other experiments. The parameters of the local-motion likelihood function (4.7) were set experimentally to A^ = 0.1, Ar = 0.3 and wnoise = 0.01. Note that, since Ar was chosen to be larger than A^, the amplitude of the local motion had a smaller impact on the value of the likelihood function in comparison to the angle. The reasoning behind this is that during an accelerated movement, typical for hands and people, the amplitude of the optical flow changes more significantly than its direction. Note that, with the exception of the number of the pyramid levels, all parameters of the combined visual model were kept fixed for all experiments. We summarize these parameters in Table 4.1.
72
Combined visual model
(a)
(b)
(c)
(d)
Figure 4.5: Images from the recordings used in the experiments with surveillance (a,b), sports tracking (c) and hand tracking (d).
Table 4.1: Parameters of the combined-visual-model probabilistic tracker. Shi-Tomasi features (Section 4.1.1)
•     Feature size: 3x3 pixels
•     Threshold on smallest eigenvalue: Lth = 10-3 Lucas-Kanade optical flow (Section 4.1.1)
•     Integration window size: 9x9 pixels Local-motion likelihood function (Section 4.2.1)
•     Minimal amplitude of local motion (4.2, 4.3): 6th = 10-2 pixels
•     Likelihood parameters (4.4): A^ = 0.1, Ar = 0.3, wnoise = 10-3
Experiment 1: Tracking in a dynamic clutter
The aim of the first experiment was to demonstrate how the proposed combined visual model can resolve situations when a person moves in a dynamic clutter.
4.4 Experiments
73
A recording from PETS 2006 database [127] (Figure 4.5a) was chosen, which contained a black person walking in front of a group of black persons. This group constituted the so-called dynamic clutter, since this clutter did not come from a static background. The size of the person in the video was approximately 13×30 pixels. The person was manually selected and tracked with Tcol and Tcmb. The results of tracking with Tcol are shown in Figure 4.6a. Even though the person was walking in front of the group, the purely color-based tracker Tcol could not discern the person from the group when they came into contact due to their visual similarity, and tracking failed. Note that it is indeed difficult even for a human observer to discern the tracked person from the others by solely looking at, for example, 155th frame in Figure 4.6a. On the other hand, the proposed combined visual model in Tcmb was able to make use of the optical flow information, and successfully tracked the person throughout the contact (Figure 4.6b).
(a)
(b)
Figure 4.6: Images from the recording used in the experiment of tracking in a dynamic clutter. The upper row (a) shows the results for tracking with the purely color-based tracker Tcol, while the results for the proposed Tcmb are shown in the bottom row (b). In all images, the current estimated position of the person is depicted by the white ellipse, while the estimated velocity is depicted by the cyan line. In the lower-row images (b), the red line depicts the local-motion feature calculated at the estimated position, while the green line shows the current model of the local motion.
74
Combined visual model
Experiment 2: Tracking a person through occlusion
In the second experiment we have reconsidered the surveillance scenario (Figure 4.5b) which was used in the previous chapter to evaluate the performance of the color-based probabilistic tracker; see Section 3.6. In that scenario, a person was tracked as he moved from left part of the scene to the right, and back to the left. While he was walking back to the left, he was occluded for a short duration by another visually-similar person. Recall that the tracker which used the color-based visual model proposed in Chapter 3 was able to track that person even when the person was located on a cluttered background. However, when that person was occluded by another person, the tracker failed since the measurements became too ambiguous. The results of tracking with the proposed combined tracker Tcmb and the purely color-based model in tracker Tcol are shown in Figure 4.7. Figure 4.7a shows that Tcol fails during the occlusion, which is due to a significant visual ambiguity of the person’s position. On the other hand, Tcmb is able to make use of the local-motion, and does not fail (Figure 4.7b). In particular, the local-motion indicates that the apparent motion of the tracked target should point to the left. Since the tracked person and the occluding person are moving in the opposite directions, this helps resolve the visual ambiguity and allows the tracker to maintain a lock on the correct target.
Experiment 3: Tracking a person through multiple occlusions
To demonstrate how the proposed tracker with a combined model performs in light of multiple occlusions between visually-similar persons, who rapidly change their direction of movement, we have considered an example of tracking a player in a squash match (Figure 4.5c). The tracked player was approximately 25 × 45 pixels large and was occluded 14 times by another visually-similar player. The player was tracked five times, and the average number of times that the tracker failed was recorded. The purely color-based tracker Tcol failed on average twelve times, while Tcmb failed on average three times. Figure 4.8a shows five frames from the recording where, after the occluded player appears (t = 418), the visual information becomes ambiguous, since both players wear white shirts, and Tcol fails (t = 425). On the other hand, Tcmb successfully utilizes the local-motion information to resolve this ambiguity, and tracks the correct player even after the occlusion (Figure 4.8b).
4.4 Experiments
75
(a)
(b)
Figure 4.7: Images from the recording used in the experiment of tracking through an occlusion. The upper row (a) shows the results for tracking with the purely color-based tracker Tcol, while the results for the proposed Tcmb are shown in the bottom row (b). In all images, the current estimated position of the person is depicted by the white ellipse, while the estimated velocity is depicted by the cyan line. In the lower-row images (b), the red line depicts the local-motion feature calculated at the estimated position, while the green line shows the current model of the local motion.
Experiment 4: Tracking person’s palms
In the fourth experiment we have considered a recording of a person waving his hands (Figure 4.5d) in front of the camera. The hands were approximately 20×20 pixels large, and were tracked independently of each other. They occluded each other 17 times with majority of occlusions occurring in front of the person’s face. The reference tracker Tcol failed 26 times, by either following the wrong hand or the face after the occlusion. The combined tracker Tcmb resolved a majority of occlusions, and failed only four times by losing the hand and locking onto the person’s face. A detailed inspection of the results showed that, in the situations where Tcmb failed, the target’s color model was strongly favoring the face, while the local-motion feature at the edge of the tracked hand still supported the target’s reference motion. The reference motion model deteriorated, which caused the tracker to drift from the hand to the face. Figure 4.9 shows an example where Tcol lost the hand after it was occluded by another hand (Figure 4.9a), while Tcmb resolved the occlusion (Figure 4.9b).
76
Combined visual model
(a)
(b)
Figure 4.8: Images from the recording used in the experiment of tracking a person through multiple occlusions. The upper row (a) shows the results for tracking with the purely color-based tracker Tcol, while the results for the proposed Tcmb are shown in the bottom row (b). In all images, the current estimated position of the person is depicted by the white ellipse, while the estimated velocity is depicted by the cyan line. In the lower-row images (b), the red line depicts the local-motion feature calculated at the estimated position, while the green line shows the current model of the local motion.
4.5    Conclusion
In this chapter we have proposed a combined visual model for probabilistic tracking which is composed of two visual models of the tracked object. The first model is the color-based model which we have proposed in Chapter 3. Since this model alone cannot resolve situations when the tracked object gets into a close proximity of another, visually similar object, we have proposed a novel visual model, which we call the local motion. The local-motion model was presented in Section 4.2, where we have shown how it can be calculated from the optical flow (Section 4.1.1) which is estimated using the Lucas-Kanade algorithm. While the Lucas-Kanade algorithm is relatively fast, it gives poor estimates of the optical flow in regions which lack texture. For that reason, we use the Shi-Tomasi features to detect regions with enough texture and estimate the optical flow only at those regions. Thus the local-motion is defined using only a sparse optical flow. To account for the errors in the optical flow estimation and rapid changes in the target’s motion, we have derived a probabilistic model of the
6=414
4.5 Conclusion
77
(a)
(b)
Figure 4.9: Images from the recording used in the experiment of tracking person’s hands. The upper row (a) shows the results for tracking with the purely color-based tracker Tcol, while the results for the proposed Tcmb are shown in the bottom row (b). In all images, the current estimated position of the person is depicted by the white ellipse, while the estimated velocity is depicted by the cyan line. In the lower-row images (b), the red line depicts the local-motion feature calculated at the estimated position, while the green line shows the current model of the local motion.
local-motion in Section 4.2.1. Since the local-motion model significantly changes during the target’s movement, an adaptation scheme for the local-motion model was devised in Section 4.2.2. In Section 4.3 we have shown how the proposed local-motion model can be probabilistically combined with the color-based model into the combined probabilistic visual model. We have also proposed a particle-filter-based tracker which uses the combined model.
Several experiments have been carried out to demonstrate improvement of tracking performance when using the combined model instead of a purely color-based model. Experiments included tracking persons in a dynamic clutter, tracking through occlusions and an example of tracking a persons’s hands through multiple occlusions. The experiments have clearly shown the superiority of the combined model over the purely color-based model by significantly reducing the number of failures during tracking.
Since the proposed combined tracking scheme can help resolve ambiguities associated with multiple visually similar targets, it can be used as an extension
78
Combined visual model
to existing multiple-target tracking schemes, such as [141], to increase their robustness. Note also that the local-motion-based feature (Section 4.2) is general enough to be used not only within the framework of particle filters, but also with non-stochastic methods, e.g., the recently proposed AdaBoost tracker [54], to resolve the ambiguities caused by the visual similarity between different objects.
And yet it moves.
Galileo Galilei (1564-1642)
Chapter 5
A two-stage dynamic model
In the previous two chapters we have explored two different visual models of the target to improve tracking in light of poor visual information. In this chapter, we will focus on another important part of the probabilistic tracker – the dynamic model by which we describe the dynamics of the tracked target. If the dynamics are well modelled, then the tracker’s performance can be improved in several ways. One improvement may be more accurate estimates of the target’s position and prediction, while the other is a smaller failure rate. Note that since the particle filters are Monte Carlo methods, the accuracy of target’s estimates depends on the number of particles used in the filter. Larger numbers of particles allow denser coverage of the target’s state space and usually result in a more accurate tracking. However, using more particles means more evaluations of the likelihood model. Depending on the complexity of the target’s visual model, these evaluations can be time-consuming, and can significantly slow down the tracking. By choosing an appropriate dynamic model, the particles can be used more efficiently. This can be achieved by directing them into the regions of the state space, which are more likely to contain the current state of the target. Smaller number of particles may therefore be used to achieve equal or even better accuracy than a poorer dynamic model would have achieved with a larger number of particles. In this sense, at a fixed accuracy, a proper dynamic model can effectively reduce the computation time required for a single tracking iteration and making the tracking feasible for real-time applications.
When tracking persons, one of the following two models is used in practice: the random-walk (RW) model or the nearly-constant-velocity (NCV) model. The simplest of the two models, the RW model, assumes that velocity can be modelled
79
80
A two-stage dynamic model
by a white noise process and is appropriate when the target performs radical accelerations in different directions; for example, when a person suddenly changes direction of its motion or suddenly stops. On the other hand, in cases when the target moves in a certain direction, the RW model usually performs poorly and the motion is much better described by the NCV model. Unlike the RW model, the NCV model assumes that the acceleration can be modelled by a white noise process. While it is true that at times human motion can be explained best by either the RW or the NCV model, the motion is often somewhere in between the two models. In this chapter we propose a two-stage dynamic model which is able to better model the dynamics of the human motion. We call the model a ”two-stage” dynamic model due to its particular structure, which is composed of two different models: a liberal and a conservative model. The liberal model allows larger perturbations in the target’s dynamics and is able to account for motions in between RW and NCV. On the other hand, the conservative model assumes smaller perturbations and is used to further adapt the liberal model to the current dynamics of the tracked object. We also propose a two-stage probabilistic tracker based on the particle filter and apply it to tracking entire persons as well as person’s hands.
The outline of this chapter is as follows. In Section 5.1 we develop the liberal dynamic model and analyze how the parameters of the model influence the model’s structure. The conservative model is proposed in Section 5.2 and in Section 5.3 we propose the two-stage dynamic model and the corresponding probabilistic tracker. In Section 5.4 results of the experiments are reported, and we summarize the chapter in Section 5.5.
5.1    The liberal dynamic model
As noted, the RW model assumes that the target’s velocity is a white-noise sequence and is thus temporally completely noncorrelated. On the other hand, the NCV model assumes that velocity is temporally strongly correlated, since the changes in velocity arise only due to the white noise of the acceleration. Thus the conceptual difference between RW and NCV models is that they assume two extremal views on the temporal correlation of the velocity. With this rationale we can arrive at a more general model by simply treating the velocity as a correlated noise, but without deciding on the extent to which it is correlated. A convenient
5.1 The liberal dynamic model
81
way to model the correlated noise is to use a Gauss-Markov process (GMP). The GMP has been previously used with some success (see, e.g., [146, 147, 176]) in modelling the acceleration of an airplane in flight, which allowed an improved tracking during air maneuvers. In this section we show that by modelling the velocity with a Gauss-Markov process, we arrive at a model of which RW and NCV are only special cases and which is able to account for more general dynamics; we will call this model the liberal model. We also provide an analysis of the parameters of the liberal model.
We start by noting that changes in the position x(t) arise due to a non-zero velocity v(t) of the target, i.e., x(t) = v(t). The velocity v(t) is then modelled as a non-zero-mean correlated noise
v(t) = v(t)+v(t),                                         (5.1)
where v(t) denotes a zero-mean correlated noise and v(t) is the current mean of the noise - the input velocity. We model the correlated noise v(t) as a Gauss-Markov process with an autocorrelation function R(r) = ae~ß^, where a2 is the variance of the process noise, and ß is the correlation time constant. To derive the dynamic model of the process (5.1) in a form which we can use for tracking, we have to first find a stochastic differential equation (s.d.e.) of the process (5.1), governed by a white-noise process, and then find its discretized counterpart.
To derive the s.d.e. of the the process v(t) in (5.1), with the correlation function Rd(r), we have to find a shaping filter (see, e.g., [26], page 137), with a system transfer function G(s)1, which transforms a unity-white noise u(t) into v(t) (see Figure 5.1). In principle, the transfer function G(s) can be easily found from the spectral densities of the input u(t) and the output v(t) of the system shown in Figure 5.1. Although the derivation of the shaping filter can be usually found in textbooks on theory of estimation and control, we provide a summarized derivation here for completeness. We start by noting that, for a stationary process, the Wiener-Khinchine relation (see, e.g., [26], page 86) tells us that the spectral density Sa(s) of the process v(t) is given by the Fourier transform T[] of the process autocorrelation function R(r); therefore we have
1We use s to denote the complex frequency jw, where w has the usual meaning of the frequency in hertz.
82
A two-stage dynamic model
Figure 5.1: A shaping filter G {s) that takes a unity white-noise signal u{t) and transforms (shapes) it into v(t).
It can also be shown (see, e.g., [26], page 130) that the spectral density Su(s) of the input u(t) and the spectral density Sy(s) of the output v(t) of the system in Figure 5.1 are related as
Sv(s) = G(s)G(-s)Su(s).
(5.3)
Since the spectral density of the unity-white noise is unity, i.e., Su(s) = 1, we have from (5.2) and (5.3)
Sz(s) = G(s)G(-s)l = T[Rv(t)}:
which gives the system transfer function, i.e. the shaping filter,
ß^ß
G{s)
s + ß
(5.4)
(5.5)
A stochastic differential equation that corresponds to the shaping filter (5.5) is exactly the s.d.e. which we seek and is given as
v{t) = -ßv{t) + y/q;u{t),                                    (5.6)
where qc = 2ßa2 is the spectral density of the equivalent white-noise process acting on d(t) and where, as before, u(t) denotes a unit-variance white-noise process.
The continuous-time s.d.e. of (5.1) can now be derived by expressing v(t) in (5.1) and plugging it into (5.6),
v(t) = -ßv{t) + ßv{t) + y/q c u{t).                              (5.7)
In order to arrive at a discretized form of the above model, we first note from (5.1) that i(t)
^{v{t) -v{t)) and assume that the input velocity v{t) remains g interval. Thus we obtain
v(t) = -ßv{t) + ßv{t) + y/qcu{t).                              (5.8)
?t        -
constant over a sampling interval. Thus we obtain
5.1 The liberal dynamic model
83
Since ±{t) = v (t), we can write the complete system s.d.e. in the matrix form
X(i) =     °     l       X(i) + P     v(t) + P     JqMt),           (5.9)
w         0-/3        W         /3                   1     V
where we have defined X(i) = [x(t), w(t)]T. The model (5.9) is a linear model with time-invariant system matrices, which makes the discretization of this system a straightforward matter (see, Appendix B). The discretized counterpart of the continuous-time liberal model (5.1) with discretized states Xfc = [xk,vk]T is
Xfc = $Xfc_! + Yvk_x + Wk,                        (5.10)
q       e-At/3        '                  ^ _ e-At/3
where 4-i is the input velocity for the current time-step k, At is the time-step length, and Wk is a white-noise sequence with a covariance matrix
Q=    qn   qu    Qc                          (5.11)
 Qu   Qi2
 912     922
Qu
 912     922
^(2Atß-l + ie-Atß-e-2Atß):
1    /i  _|_    -2At/3 _ 2p-At/3\
912
2ß2
g22 =   L(l_2e-2A*/3).
2/3
Note that there are two parameters which can be set in the liberal model (5.10, 5.11): one is the correlation-time parameter ß and the other is the spectral density qc of the noise. In the following we first give an analysis of how the parameter ß influences the structure of the proposed liberal model. Then we propose a method for selecting the spectral density qc for a given class of objects.
5.1.1     Parameter ß
In terms of the parameter ß, the dynamic properties of the liberal model (5.10) can be considered as being in between a random-walk and a nearly-constant-velocity model2; this can be seen by limiting ß to zero, or to infinity. In the case
2Recall that the conceptual difference between the RW and the NCV model is in the way they treat the time-wise correlation of the target’s velocity. For completeness, we give derivations of the RW and NCV models in the Appendix B.1 and Appendix B.2, respectively.
_
84
A two-stage dynamic model
of ß -»¦ 0, the model takes the form of a pure NCV model with a state transition matrix $^0 and the input matrix I>^0
0    M 'r^0=     0
^-o=     J   ^     ,T^o= P     .                       (5.12)
On the other hand, at /3 -»¦ oo, the model takes the form of a RW model with the state transition matrix ^^ and the input matrix Tß^x
ri o i        Til
oo   'r^°°=    i
^/3-oo=        J     q       ,r^oo=        |        •                                       (5.13)
Note that the values of iy^«, are nonzero, thus the input velocity has to be set to zero, t)fc_i = 0, to obtain the pure random-walk model. For comparison of the system and input matrices (5.12, 5.13) with those of a NCV and RW model, see Appendix B.l and Appendix B.2
We have seen thus far that the liberal dynamic model takes the structure of RW and NCV models at the limiting values of ß. But what happens when ß is set to somewhere in between zero and infinity? To get a better understanding of that, it is beneficial to rewrite the model in the following way. Let xfc = [xk,vk]T denote the state at the time-step k with position xk and velocity vk, and, similarly, let xfc_! = [xfc_i, vk_x}T denote the state at the previous time-step k - 1. We also rewrite the elements of the system transition matrix $ and the input matrix T (5.10) in the following abbreviated form
[i *i,a] r=Ul
Note from (5.10) that $ and T depend on the size of the time-step At, which is the time between one and the next measurement. Without loss of generality we can set the time-step to unity At = 1. For completeness, let us also define the values of the noise terms, at time-step k, acting on the position and velocity by Wk = [wxk,wvk]T. Now we can rewrite the liberal model (5.10) in terms of the state’s components as
xk         ^-i + ^-i+W-i+%                        (5.14)
Vk     =     <h,2Vk-l+ l2Vk-l +Wvk .
Since we have set At = 1, we have from (5.10) and (5.14) 01,2 + 71 = 1 and 02,2+ 72 = 1.
5.1 The liberal dynamic model
85

1 0.8 0.6 0.4 0.2
0
0
4            6
ß
(a)
10
1 0.8 0.6 0.4 0.2
0
 0.4
0
4            6
ß
(b)
10
Figure 5.2: The values of the components of$ and T at At = 1 w.r.t. different values of ß. The left graphs (a) show 01)2 and ^ which are used for mixing vk_x and vk-i, respectively, in estimating the current position xk. The right graphs (b) show the values of 02)2 and 72 which are used for mixing vk.x and vk_u respectively, in estimating the current velocity vk. In (a), the values of (j)1>2 are depicted by the dashed line, while the values of ^ are depicted by the full line. Similarly, in (b), the values of 02)2 are depicted by the dashed line, while the values of72 depicted by the full line. In both images, the upright dash-dotted line depicts the values of (j)1>2, fa,2, 71 and 72 at ß = 2. For convenience, these values are written out at the marked locations.
This means that <f>1>2 and 7l are the proportions in which the internal velocity vk_x and the input velocity vk_x will be combined into the deterministic part of the velocity acting on the current position xk3 Similarly, 02)2 and 72 are the proportions in which the internal velocity vk.x and the input velocity vk.x will be combined into the deterministic part of the velocity acting on the current velocity vk. With At fixed, the values of the mixing factors <f>1>2, <fa>2, 7i and 72 depend solely on ß. We show this dependence in Figure 5.2.
Form the Figure 5.2 we see that by increasing ß, the influence of the input velocity vk.x increases in (5.14), and for a very large ß, the internal velocity vk.x is completely disregarded by the dynamic model as <f>1>2 and 02)2 of (5.14) tend to zero.  On the other hand, 71 and 72 tend to zero for small values of ß.  This
3The nondeterministic part of the velocity acting on the current position xk is the white noise wxk.
2
8
2
8
86
A two-stage dynamic model
means that we can consider ß as a parameter that specifies an a-priori confidence of the input vk_x and internal velocity vk_x. If, for example, we know that vk_x is very accurate, then ß should be set to a very large value. Otherwise, smaller ß should be used.
The two-stage dynamic model which is presented in this chapter usually yields reasonable estimates of the input velocity ^-i for a large class of targets. In practice we have observed that it is thus beneficial to let the input velocity vk.x have a dominant effect over vk_x in estimating the current velocity vk. However, if we want the liberal model to be able to account for a greater agility of the target, it is also beneficial to let the internal velocity vk.x to have a greater effect on predicting the current position xk. We have found that these requirements are sufficiently well met at ß « 2 which is the value we use in all subsequent experiments. The values of <f>1>2 and 7l at ß = 2 are shown in Figure 5.2a, while the values of 02)2 and 72 at ß = 2 are shown in Figure 5.2b.
5.1.2    Selecting the spectral density
Another important parameter of the liberal model (5.10) is the spectral density qc of the process noise (5.11). Note that in many cases it is possible to obtain some general characteristics of the dynamics of the class of objects which we want to track. Specifically, the expected squared distance o2m that objects of certain class travel between two time-steps is often available. Assuming that we have some estimate of o2m, and that the time-step size ?t and the parameter ß are known, we now derive a rule-of-thumb rule for selecting the spectral density qc.
To derive the rule-of-thumb, let us consider the following example. Assume that at time-step k - 1 a target is located at the origin of the coordinate system, i.e., xk_x = 0, and begins moving with a velocity vk_x ~ q22qc, i.e., Xfc_! = [0,wfc_i]T. Assuming that the input velocity ^-i in (5.10) is zero, the target’s state after a single time-step is
Xfc = ?Xfc_! + Wk.                                      (5.15)
The covariance of the position at time-step k is
P   =    (XfcX^)
(?Xfc_1X^,_1?T) + <?Xfc_iWf) + (WkX'[_1?T) + (WkWL), (5.16)
5.2 The conservative dynamic model
87
where (•) denotes the expectation operator. Since the state Xfc_! is not correlated with the noise Wk and since Q={WkWL), the equation (5.16) simplifies into
P=      PU   PU      =?(Xfc_1X^_1)?T + Q.                       (5.17)
 P21     P22
Since pn in (5.17) is just the expected squared change of target’s position in consecutive time steps, i.e. pn = a2m, we have
°2m   =   Pn
1 _ p-^tp =   (—^------) 2<t>fc-it>fc-i> + <?ii<7c                          (5.18)
Since we have defined earlier vk_x ~ q22qc, we know that {vk_lVk_x) = q22qc, and (5.18) is rewritten into
(J
((------------)2922 + qu)qc-                              (5.19)
Inverting (5.19) finally gives the rule-of-thumb rule for selecting the spectral density
1 _ p-Atp Qc = °2m((-^0------)2922 + qu)-1.                              (5.20)
5.2    The conservative dynamic model
In the previous section we have presented a liberal dynamic model (5.10) which we have derived from a continuous-time non-zero-mean Gauss-Markov process. While a detailed discussion of how different parameters of the Gauss-Markov process (GMP) influence the structure of the liberal model was provided, we have not said anything about estimating the mean value of the GMP, which we have named the input velocity vk_x (5.10). In this section we propose another dynamic model, a conservative dynamic model, which will serve for estimating vk-i during tracking, and thus further adapt the liberal model to the current dynamics of the tracked object. The conservative model assumes that the target’s velocity does not change abruptly and approximates the local dynamics by fitting a linear model to the past filtered states. This model is then used to regularize the estimated target states from the particle filter, as well as for estimating the input velocity vk.x of the liberal model.
88
A two-stage dynamic model
Let ok_K±_x = {oi}tt"fc_K denote a sequence of the K past regularized states o; of the tracked target and let irk-K:k-i = Mtk-K denote the set of their weights. These weights indicate how well the corresponding states have been estimated. The conservative model aims to locally approximate the sequence Ok-K-.k-\ by a linear model
o(ti) = vfc_iti + afc_i,                                     (5.21)
where U is the time at i-th time step. Since all states have not been estimated equally well, and since the recent states bare more information about the current velocity of the target, the parameters vfc_! and afc_! of the linear model (5.21) are estimated such that they minimize the following weighted sum of squared
differences
fc-i Cfc-i =   Yl  Gk-idIdi   ,   di = °i - Vfc-iti - afc_i,             (5.22)
i=k-K
where the weights G§ are defined as
i (i-i) 2
G« =7r«e"T.                                       (5.23)
While the first term in (5.23) reflects the likelihood of the state oi, the second term is a Gaussian which assigns higher a-priori weights to the more recent states. In practice this means that we only consider K = 3a0 past states in (5.22), since the a-priori weights of all the older states are negligible. Note that the Gaussian form was used for the last term exclusively to attenuate the importance of the older states. In general, however, other forms that exhibit similar behavior (e.g., an exponential function) could have been used.
From (5.22) we can now find vfc_! and afc_! simply by setting the corresponding partial derivatives to zero
acfc-i
=0    ,             =0,                                   (5.24)
which gives
9vfc_!              <9afc_!
E   UG^Oi + Ak^(  E   GfcliO.)( E   UG®!)
vfc_!   =   ^^------------------------l-k~K   fc_i------l-k~K                     (5.25)
E   L?4-i-^fc-i(  E   ^4-i)2
i=k—K                                    i=k—K
fc-1                                                            fc-1
afc_i   =   Ak_x( Y,  4-i°i - Vfc-i   Y,  UG^li),                      (5.26)
i=k—K                                   i=k—K
5.3 A two-stage dynamic model
89
where we have defined
fc-i
Afc_! = ( Y,  Gk-i)'1-                                  (5.27)
i=k-K
Once the parameters of the conservative model (5.21) are obtained, the input velocity of the liberal model (5.10) is approximated by the locally-regularized velocity vfc_! (5.25), and the regularized state ok of the target is calculated as follows. The estimate of the target’s state xfc is calculated from the liberal model and fused with the prediction o(tk) (5.21) of the conservative model according to their visual likelihoods w±k = p(yfc|xfc) and wo{tk) = p(yk\o(tk)), respectively4, as
Ofc =----------------------.                               (5.28)
The corresponding weight nk of the new regularized state ofc is evaluated using the visual likelihood function, nk = p(yk\ok), and the parameters of the conservative model (5.21) are recalculated using (5.25). The regularized prediction from the two-stage dynamic model can then be calculated using the following relation
~fc+i = Ofc + ?tvfc.                                       (5.29)
5.3    A two-stage dynamic model
In Section 5.1 we have presented a liberal model (5.10) which was derived from a continuous-time non-zero-mean Gauss-Markov process and is capable of accounting for various types of dynamics, ranging from a random walk to the nearly-constant-velocity behavior. In Section 5.2 another model, called the conservative dynamic model, was presented. The conservative model applies stronger constraints on the target’s velocity and is used for estimating the mean value (input velocity) of the Gauss-Markov process in the liberal model. In this section we propose a two-stage dynamic model, which probabilistically combines the conservative and the liberal model and is applied to probabilistic tracking using a particle filter. We call the resulting tracker a two-stage probabilistic tracker and the structure of this tracker is shown in Figure 5.3. Since the liberal dynamic model allows greater perturbations in the target’s dynamics, we use it
4The visual likelihoods p(yk| x^k) and p(yk|o(tk)) refer to the likelihood function used in the particle filter, like the color-based likelihood function presented in Chapter 3, the combined likelihood function presented in Chapter 4 or any other appropriate visual likelihood function.
90
A two-stage dynamic model
Conservative dyn. model
d  Vfc-1
Liberal dyn. model
i:
Particle filter
Xfc
Figure 5.3: The structure of the two-stage probabilistic tracker. The liberal model is embedded into a particle filter and used to estimate the target’s current state x^k. The conservative model, on the other hand, is used to estimate the mean value (the input velocity) v^k-1 of the Gauss-Markov process in the liberal model as well as to regularize the output of the particle filter x^k into an improved estimate ok of the target’s state.
within the particle filter to efficiently explore the target’s state space. On the other hand, the conservative model assumes smaller perturbations in dynamics and is used to estimate the input velocity of the liberal model, as well as for regularizing the output of the particle filter in light of the visual observations. The output of the tracker is thus an improved, regularized, estimate of the target’s state.
An iteration of the proposed two-stage probabilistic tracker proceeds as follows. First, an input velocity vk_x for the liberal model (5.10) is estimated from the conservative model by the locally-regularized velocity vfc_! (5.25). Then a tracking iteration of a particle-filter-based tracker which uses the liberal model is carried out. The posterior of the target’s state becomes available from the particle filter and is used to estimate the new MMSE state xfc of the target (2.32). A prediction o(tk) of the regularized state is calculated from the conservative model (5.21) and a weight wo{tk) = p(yk\o(tk)) is assigned to the prediction according to the visual likelihood function p(yfc|o(tfc))5. Similarly, a weight w*k = p(yfc|xfc) is assigned to the MMSE estimate xfc which is then fused with o(tfc) according to (5.28) into the new regularized state ofc, and the regularized velocity vfc is recalculated using (5.25). The entire tracking procedure of the two-stage probabilistic tracker is summarized in Algorithm 5.1.
5In our implementations we use the same color-based likelihood function as in the particle filter.
5.3 A two-stage dynamic model
91
Initialize:
• Initialize the tracker by selecting the target (e.g., manually).
Tracking:
• For k = 1,2,3,...
1.  Approximate the current input velocity v^k-1 of the liberal model by the locally-regularized velocity v^k-1 from the conservative model (5.25).
2.  Carry out a tracking iteration of the particle-filter-based tracker using the liberal model (5.10).
3.  Calculate the MMSE estimate x^k (2.32) from the posterior obtained from the particle filter.
4.  Calculate the conservative prediction o(tk) from the conservative model (5.21).
5.  Fuse the conservative prediction o(tk) with the MMSE estimate x^k according to their respective visual likelihoods into a new regularized state ok using (5.28).
6.  Evaluate the visual likelihood of the regularized state ok and update the parameters of the conservative model.
Algorithm 5.1:  The two-stage probabilistic tracker.
Since the two-stage dynamic model is composed of the liberal and the conservative model, there are a few parameters that have to be set. Two parameters have to be set for the liberal model (5.10): the parameter ß and the spectral density qc of the process noise. A detailed discussion of how the parameter ß influences the structure of the liberal model was provided in Section 5.1.1. There we have concluded, that the required dynamic properties of the liberal model are met at ß = 2. The remaining parameter of the liberal model, the spectral density qc, has to be specified for the problem at hand and we have proposed a principled way to selecting qc in Section 5.1.2. The conservative model requires setting a single parameter a0, which effectively determines the number of the recent regularized states which are considered in the linearization. We set
92
A two-stage dynamic model
this parameter using the following rationale. We can assume that the objects which are considered in our applications do not usually change their velocity drastically within a half of the second. Since most of our recordings used in the experiments are recorded at 25 frames per second, this means that we consider only K = 125 « 13 recent regularized states. We have noted in Section 5.2 that K = 3a0, which means that a0 = 4.3. For convenience, we summarize the parameters in Table 5.1.
Table 5.1: Parameters of the two-stage dynamic model._______
The liberal dynamic model (Section 5.1)
•     Parameter /3 = 2.
•     Spectral density qc is selected by the rule-of-thumb rule (Section 5.1.2) The conservative dynamic model (Section 5.2)
•     Parameter a0 = 4.3.
5.4    Experimental study
We carried out two sets of experiments to evaluate the performance of the two-stage dynamic model from Section 5.3. In the first experiment (section 5.4.1) we have tracked persons moving on a predefined path on the ground. This experiment was designed for quantitative and qualitative comparison between the proposed two-stage dynamic model and two widely used dynamic models. To demonstrate the generality of the proposed dynamic model, we have also applied it to tracking person’s hands in the second experiment (section 5.4.2).
5.4.1    Experiment 1: Tracking entire persons
Seven players of handball were instructed to run on a predefined path drawn on the court (Figure 5.4). The path was designed such that the observed motion involved accelerations, decelerations and rapid changes in the direction of motion. The scene was recorded with a camera mounted on the ceiling of the sport’s hall, such that the size of each player was approximately 10×10 pixels. The video was recorded at the frame rate of 25 frames per second. Each player was manually tracked thirty times through each frame and the average of the thirty trajectories
5.4 Experimental study
93
obtained for each player was taken as the ground truth. In this way approximately 273 ground-truth positions pk = (xk,yk) per player were obtained.
1                 2                 3                 4                 5                 6                 7
Figure 5.4: Seven players and the path used in the first experiment.
All seven players from Figure 5.4 were then tracked with three trackers: Two reference trackers and the proposed tracker. The only difference between these trackers was in the dynamic models they used for modelling the dynamics of the player’s position. The proposed tracker, we denote it by TTS, was the two-stage probabilistic tracker from Algorithm 5.1, which modelled the dynamics of the player’s position with the two-stage dynamic model. The reference trackers were essentially the color-based particle filters from Chapter 3, which employed two widely-used dynamic models on the player’s position. The first reference tracker, TRW, used the random-walk model, while the second reference tracker used the nearly-constant-velocity model; we denote this tracker by TNCV. All three trackers used random-walk models to model the dynamics of the player’s size.
The parameters of the RW and NCV dynamic models in TNCV and TRW were learned from the ground truth. In particular, the only parameter of the RW and NCV model that has to be specified is the spectral density of the process noise (see, e.g., equations B.10 and B.13 in Appendix B). The spectral densities were estimated using a linear-dynamic-system learning method (see, e.g., [15] pages
94
A two-stage dynamic model
635-644)6 from 7 x 30 = 210 ground truth trajectories. The method yielded the spectral density qRW = 4.6 for the RW model and the spectral density gNCV = 0.4 for the NCV model. We have observed in experiments that the estimated spectral density for RW was too small and, in practice, tracking was failing frequently for some of the players. For that reason, the spectral density in the RW model was increased to qRW = 6 in the experiments.
The spectral density qc of the liberal model (5.10) in TTS was determined using the rule-of-thumb rule, which we have proposed in Section 5.1.2. Recall that the rule requires us to provide an estimate of the squared distance o2m that the objects under consideration are expected to travel between two time steps. Since we track sports players in our experiment, we can find o2m as follows. Based on the findings of Bon et al. [21], who refer to Kotzamanidis [85], Erdmann [46] and Bangsbo [8] regarding the dynamics of handball/soccer players, we can estimate the highest velocity of a player as vmax = 8.0m/s. At a frame rate of 25frames/s we can say that vmax = 0.32m/frame. During tracking, the player is usually determined by an ellipse that is approximately the size of his/hers shoulders. We estimate this size to be Ht « 0.4m. Assuming a Gaussian form of the velocity distribution, the highest velocity can be approximated with three standard deviations of the Gaussian. This gives vmax = 3a^/frame and the parameter am = Htfff = Ht\. Using the rule-of-thumb rule (5.20) the spectral density of the liberal model is thus estimated as
qc = (Ht\)2(qn + q22(±=<L±)2)-\                            (5.30)
where qn and q22 are defined in (5.11).
Quantitative evaluation
Using the parameters given above, all seven players from Figure 5.4 were tracked thirty times with the trackers TRW, TNCV and TTS. Thus K = 30 trajectories per player were recorded for each tracker. Note that TRW and TNCV have failed during tracking on a few occasions by losing the player.   In those situations,
Conceptually, the linear-dynamic-system learning method is an expectation maximization (EM) algorithm that iterates between two steps. In the first step (the expectation), the trajectories are filtered using a forward-backward Kalman filtering with the estimated system parameters to infer the hidden, true, trajectory. In the next step (the maximization), the filtered trajectories and the corresponding uncertainties are used to re-estimate the system parameters.
5.4 Experimental study
95
tracking was repeated and only the trajectories where tracking did not fail were considered for evaluation. In all experiments TTS never failed.
A standard one-sided hypothesis testing [10] was applied to determine whether the accuracy of estimation by TTS was greater than the accuracy of the reference trackers TRW and TNCV. In the following, when not referring to a specific tracker, we will abbreviate the reference trackers by TREF. The performance of the trackers in the r-th repetition was defined in terms of the root-mean-square (RMS) error as
7               K
cwa1 Y.111 n(i)p* - (i)Pfcr)n2)*-                       (5.31)
i=\          k=\
In (5.31) (i)pfc denotes the ground-truth position at time-step k for the i-th player, «pW is the corresponding estimated position and || • || is the /2 norm. At each repetition, a s ample-performance-difference
?« = CL>F - eg                                      (5.32)
was calculated. The term cf] was the cost value (5.31) of TTS, while C(LEF presented the cost value of the reference tracker TREF.
In our case the null hypothesis H0 was that TTS is not superior to TREF. For each tracker we calculated the sample-performance-difference mean
? =      Vß   ?«                                       (5.33)
and its standard error
<*K
\
1     R
J^ (?M-?)2.                                (5.34)
The null hypothesis was then tested against an alternative hypothesis Hu that TTS is superior to the reference tracker TREF, using the statistic ^ Usually, the alternative hypothesis is accepted at a significance level of a if A > aa, where ßa represents a point on the standard Gaussian distribution corresponding to the upper-tail probability of a. In our experiments we used a = 0.05, which is a common practice in hypothesis testing.
The results of the hypothesis testing on position and prediction with respect to a different number of particles in the particle filter are shown in Table 5.2 and Table 5.3.  Table 5.2 shows the results for testing the hypothesis that TTS
_
96
A two-stage dynamic model
is superior to TRW, while Table 5.3 shows the results for testing the hypothesis that TTS is superior to TNCV. The second and third column in Table 5.2 and Table 5.3 show the test statistic A. In all cases the test statistic is greater than /uo.05 = 1.645. From Table 5.2 we can thus accept the hypothesis that TTS is superior to TRW in estimating the position and the prediction at the a = 0.05 level. Similarly, from Table 5.3 we can also accept the hypothesis that the tracker TT<j is superior to TNCV in estimating the position and the prediction at the a = 0.05 level. Note that these hypotheses could have been accepted even at levels lower than a = 0.01 (/i0.oi = 3.090). Since the only difference between the Tts, Trw and TNCV was in the dynamic model of the player’s position, we can conclude that the two-stage dynamic model is superior to both, the random-walk, as well as the nearly-constant-velocity model.
Table 5.2: Results for the comparison ofTTS and TRW from 30 runs using the
test statistic Ä
no. particles                    Position (M                    Prediction (M
25                                   19.2                                    32.8
50                                   24.5                                    54.9
75                                   71.0                                   148.6
100                                  62.9                                   149.2
Table 5.3: Results for the comparison ofTTS  and TNCV from 30 runs using the test statistic Ä
______________^A______________________________________
no. particles                      Position (A)                    Prediction (A)
___________________________________________crA____________________________________crA________
25                                       14.4                                    14.7
50                                        7.5                                      7.7
75                                       8.6                                      7.7
100                                      6.0                                      4.8
Qualitative evaluation
To further illustrate the performance of the trackers, the RMS errors (5.31) were averaged over all thirty repetitions for each tracker and are shown in Figure 5.5(a,b) and Figure 5.6(a,b).   To visualize how the smoothness of the
5.4 Experimental study
97
								1.8 1.5 1.2  0.9 0.6  2	
6.3 5.7	•	\__		7.2  6  4.8 3.6					•
 5.1							--------¦		
4.5 3.9 3.3 2									•^               *"*¦
	5	50                 75 number of particles	100                25			50                 75 number of particles	100		5                 50                 75               100 number of particles
(a)
(b)
(c)
Figure 5.5: Graphs on (a) and (b) show the average RMS errors (denoted by C) of position (a) and prediction (b), respectively, as a function of the number of particles. Graphs in (c) show the mean-ab solute-differences (denoted by MAD) values of position estimates. The results for TRW are depicted by the dotted lines, while solid lines depict the results for TTS.
obtained trajectories changes with the number of particles, we have also calculated the mean-absolute-differences (MAD) on positions for different numbers of particles in the particle filter,
30
1
7
1
K
30 ^ 7 ^ K
r=\        i=\        k=\
30
30  ^        rk
r=\
Pfc          Pfc
(5.35)
where «pfc = ^ E {l)Pk' was the position of the i-th player at fc-th time-step, averaged over thirty tracking repetitions (Figure 5.5c and Figure 5.6c).
Figure 5.5 thus shows the results for the average RMS errors of position and prediction and MAD values of position when the number of particles used in the particle filter is varied. Using only 25 particles the proposed dynamic model in TTS achieved smaller RMS errors for position (Figure 5.5a) and prediction (Figure 5.5b) than the TRW, even when four times as many particles were used in the TRW. TTS also consistently produced smaller MAD values than TRW for all numbers of particles (Figure 5.5c).
In Figure 5.6, we can compare the average RMS errors and MAD values between TTS and TNCV. Using only 25 particles, the TTS achieved equal average RMS errors for position (Fig. 5.6a) and prediction (Fig. 5.6b) as the TNCV with 100 particles. TTS also consistently produced smaller MAD values than TNCV for all numbers of particles (Figure 5.6c) and, again, using only 25 particles TTS achieved approximately equal MAD value as NCV at 100 particles. An important
98
A two-stage dynamic model
4.5 4.2»,
3.8"
3.5                                ......¦»........
¦                    «
25                  50                 75                100
number of particles
(a)
Figure 5.6: Graphs on (a) and (b) show the average RMS errors (denoted by C) of position (a) and prediction (b), respectively, as a function of the number of particles. Graphs in (c) show the mean-ab solute-differences (denoted by MAD) values of position estimates. The results for TNCV are depicted by the dotted lines, while solid lines depict the results for TTS.
point to note here is that the TTS outperformed the TRW and TNCV even though the spectral densities for the TRW and TNCV were estimated from the test data. In fact, since vfc_! was not taken into account in the derivation of the rule-of-thumb rule (5.20), the obtained spectral density for TTS was overestimated, and presents an upper bound on the actual density. Nevertheless, the two-stage model outperformed both, the RW and the NCV model. This implies powerful generalization capabilities of the proposed two-stage dynamic model.
To illustrate how the parameter ß affects tracking performance, the tracking experiment was repeated for TTS at ß = [0,1,2,3,4,5,6,7,8,9,10] with 25 particles in the particle filter. The average RMS errors of position and prediction as well as MAD values for position were recorded. In Figure 5.7 we compare these results with the results of TRW and TNCV when using 25 particles in the particle filter. The proposed dynamic model in TTS outperformed the RW and NCV model for all values of ß. Note that the RMS errors and MAD values for TTS were increasing for decreasing ß and reached maximum at ß = 0. Recall from the discussion in Section 5.1.1 that for a small ß the structure of the liberal model in TTS approaches the nearly-constant-velocity model. This means that at ß = 0, the liberal model was in fact the nearly-constant-velocity model. However, the errors were still lower than the errors of the tracker TNCV which used the pure NCV model. This can be attributed to the regularization effect of the conservative model in the two-stage dynamic model of TTS. Also note that the errors of TTS
3
50                 75
number of particles
(b)

50                 75
number of particles
(c)
4.5
1.8
4.2

3.5
25
100
25
100
5.4 Experimental study
99
do not significantly decrease with increasing ß over ß = 2, which further confirms our discussion in Section 5.1.1 on the choice of this parameter.

0123456789   10
ß
(a)

7.5 6.5 5.5
0123456789   10
ß
(b)
 1
0123456789   10
ß
(c)
Figure 5.7: Graphs on (a) and (b) show the average RMS errors (denoted by C) of position (a) and prediction (b), respectively, as a function of the parameter ß. Graphs in (c) show the mean-ab solute-differences (denoted by MAD ) of position estimation. The results for TRW are depicted by the dashed lines, the results for TNCV are show in dotted lines, while the solid lines depict the results for TTS.
5.4.2    Experiment 2: Tracking person’s hands
To demonstrate the generality of the proposed two-stage dynamic model, we have revisited the experiment of tracking person’s hands from Chapter 4. There, a person was facing the camera and waving his hands; an image of the person is shown in Figure 5.8. Both hands were approximately 20 x 20 pixels large, and were tracked with the two-stage tracker from the previous experiment. All parameters of the tracker remained the same as in the previous experiment, except for the spectral density qc. The spectral density was estimated using the rule-of-thumb rule from section 5.1.2 and assuming that the expected distance that the hand travels between two time-steps is approximately am = 6 pixels. The number of particles in the particle filter was set to only iV = 25 particles. We denote this tracker by TTS. For reference, the hands were also tracked using a tracker which applied a nearly-constant-velocity (NCV) model instead of the two-stage dynamic model and which used N = 50 particles in the particle filter; we denote this tracker by TNCV.
The hands were tracked separately five times with TTS and TNCV, and an average times that the tracker lost a hand was recorded. The results of tracking
6.5
1.9
1.3

4.5
1
3.5
3.5
100
A two-stage dynamic model
Figure 5.8: Image of a person waving his hands.
using the purely color-based model from Chapter 3 are shown in Table 5.4, while the results for tracking with the combined visual model from Chapter 4 are shown in Table 5.5. From Table 5.4 we see that when the purely color-based visual model was used, TNCV lost a hand on average 26 times, while the two-stage dynamic model in TTS reduced the number of failures to 16. We have observed a similar reduction of failures when the combined visual model was used (Table 5.5) instead of the purely color-based. There, the TNCV failed only four times, however, the two-stage dynamic model in TTS was still able to reduce the number of failures by two failures. Note also, that not only did the two-stage dynamic model reduce the number of failures in comparison to the NCV model, but was able to do so requiring half as many particles in the particle filter as the NCV model.
Table 5.4:   Results for tracking hands using a purely color-based model from Chapter 3
tracker       dynamic model       number of particles       number of failures
TTS              two-stage                          25                                  16
TNCV                NCV                             50                                  26
Table 5.5:   Results for tracking hands using the combined visual model from Chapter 4
tracker       dynamic model       number of particles       number of failures
TTS              two-stage                          25                                   2
TNCV                NCV                             50                                   4
5.5 Conclusion
101
From the results in Table 5.4 and Table 5.5 we can conclude that the two-stage dynamic model improves tracking by reducing the number of failures, while at the same time requiring only a small number of particles in the particle filter. We can also conclude that the two-stage dynamic model is general enough to improve tracking not only when tracking entire persons but also parts of persons, such as hands.
5.5    Conclusion
In this chapter we have proposed a two-stage dynamic model and a corresponding two-stage probabilistic tracker, which can account for various types of motions, which we usually encounter when tracking persons. The proposed model is composed from two separate dynamic models. The first dynamic model is called the liberal dynamic model which was derived in Section 5.1 from a nonzero-mean Gauss-Markov process. An analysis of the parameters of the liberal model in Section 5.1.1 has shown that two widely-used models, the random-walk (RW) model and the nearly-constant-velocity (NCV) model, are obtained at the limiting values of the model’s parameters. We have also noted that the liberal model can explain even motions which are in between the RW and the NCV model. An important parameter of the liberal model is the spectral density of the Gauss-Markov process, which depends on the dynamics of the class of objects to be tracked. In Section 5.1.2 we have therefore derived a rule-of-thumb rule to selecting this density, which requires only a vague estimate of the target dynamics. Furthermore, by controlling the mean value of the Gauss-Markov process, the liberal model can even further adjust to the dynamics of the tracked target. To efficiently estimate this mean value in the liberal model, another dynamic model, which we call the conservative model, was proposed in Section 5.2. In contrast to the liberal model which allows greater perturbations in target’s motion, the conservative model assumes stronger constraints on the target velocity. In Section 5.3 we have proposed a two-stage probabilistic tracker which uses the liberal dynamic model within a particle filter to efficiently explore the state space of the tracked target. On the other hand, the conservative model is used to estimate the mean value of the Gauss-Markov process in the liberal model as well as for regularizing the estimations from the particle filter.
102
A two-stage dynamic model
Two experiments were designed to evaluate the performance of the proposed two-stage dynamic model. The first experiment involved tracking persons running on the path which was drawn on the floor. The path was designed such that the observed motion included accelerations, decelerations, short runs in a certain direction and sudden changes in the direction of motion. All persons were tracked with the proposed dynamic model as well as with two reference trackers which employed one of the two widely-used dynamic models – the RW model and the NCV model. The results have shown that the proposed dynamic model performed significantly better than the RW as well as NCV model. In particular, the two-stage dynamic model yielded a better accuracy of tracking in comparison to the RW and NCV models, and at the same time required significantly smaller number of particles in the particle filter. In the second experiment we have tracked person’s hands using the proposed dynamic model and a NCV model. The proposed dynamic model was able to use half as many particles in the particle filter as the NCV model while still reducing the number of times that tracking failed in comparison to the NCV model. The results of the two experiments firstly imply superiority of the two-stage model over the RW and NCV in accounting for various dynamics of moving persons as well as parts of persons such as hands. Secondly, the two-stage model allows using smaller number of particles, which can in practice significantly reduce the computation time of a single tracking iteration and thus makes tracking more feasible for real-time applications.
A set is a Many that allows itself to be thought of as a One. Georg F. L. P. Cantor (1845 – 1918)
Chapter 6
Tracking multiple interacting targets
In previous chapters we have focused on deriving efficient visual and dynamic models for tracking persons in the visual data. In this chapter we will consider the tracking of multiple targets as a state-estimation problem, which we pose in terms of Bayesian recursive filtering. If we assume that targets can interact, then the state of one target at one time-step may also depend on the state of another target at that time-step. Also, the visual data corresponding to one of the targets may depend on other targets as well, since different constellations of targets differently influence the visual data. We can thus think about tracking of multiple targets as a problem of estimating the state of a complex system, where the state of this system is exactly the constellation of all targets along with their internal parameters (e.g., their sizes). A direct approach to solving this problem might be to concatenate the states of all targets into a single joint-state and apply the particle filter to recursively estimate the posterior over that joint state. However, as we have discussed in the related work (Section 1.1.3), this approach has several drawbacks. As it turns out, the dimensionality of this problem increases exponentially with the number of targets considered. In order to satisfy even a modest criteria of the estimation accuracy, the number of particles in the particle filter needs to be increased significantly. Since the visual model has to be calculated and compared to a reference model for each particle, this leads to an increased expenditure of computational resources and dramatically slows down tracking.
In this chapter we argue that, in certain applications, the context within which the targets are observed can be used to simplify the tracking. In particular, we will focus on applications when the camera is positioned such that the scene is
103
104
Tracking multiple interacting targets
Image
Figure 6.1: An example of the camera placement such that the scene is viewed from a bird’s-eye view (left) and an example of an image from the camera (right). The image is partitioned into several partitions, such that each partition contains a single person. These partitions are illustrated as circular regions around the persons.
viewed from a bird’s-eye view. We will show that in those cases a coarse model of the target position can be derived and used to simplify the Bayesian filtering of the targets’ joint states.
This chapter is summarized as follows. In Section 6.1 we show how the context within which targets are viewed can be used to derive restrictions of the targets’ positions and simplify the Bayes estimation. A parametric model of these restrictions is given in Section 6.2 and in Section 6.3 we derive the context-based multiple target tracking scheme. Results of the experiments are given in Section 6.4 and in Section 6.5 we draw conclusions.
6.1    Using the physical context
In many applications, such as tracking in sports and visual surveillance, the camera is placed such that the scene is viewed from above; see, for example, Figure 6.1. In these situations, the objects often appear similar and their identities cannot be maintained simply based on the visual information. Although the motion-based visual models (e.g., the one which was proposed in Chapter 4) may resolve some situations where visually-similar targets interact, they cannot resolve situations in which several targets collide and stop moving, or stay together and move as a single body.
6.2 Parametric model of partitions
105
However, we can still make use of the camera placement. While (near) collisions among objects can be frequent, the bird’s-eye view guarantees that complete occlusions between objects are rare. This means that we can partition every current image into a set of partitions, such that each partition contains only a single target (Figure 6.1b). If that is the case, then each target can be tracked with a separate tracker only within the corresponding partition.
In Bayesian terms, the partitioning, we denote it by Vfc, can be viewed as a latent variable, which simplifies the tracking in two ways. Firstly, if we know Vfc, then the state of one target becomes independent of the states of the other targets, since we already know that there is only a single target in each partition. Secondly, the visual data which corresponds to one target is independent of the other targets since it comes form a region in image which contains no other target.
Formally, we let Xfc denote the joint state, i.e., the concatenation of the states WXfc of all Np targets Xk={^^k}fzv In terms of Bayesian filtering, the aim is to estimate the joint-state posterior p(Xfc|y1:fc) through time. If the current partitioning Vfc is known, the targets’ states become conditionally independent given the partitioning. The posterior conditioned on Vfc factors across all the targets as
p(X*|yi:fc, Vfc) = np((j)Xfc|y1:fc, Vfc),                           (6.1)
i=i
where p(ü)xfc|y1:fc, Vfc) is the posterior of the j-th target conditioned on the partitioning Vfc. This directly implies that the Bayes filter for the joint-state simplifies into Np separate single-state Bayes filters, and the complexity of the filter becomes linear with respect to the number of targets.
6.2    Parametric model of partitions
Until now we have talked about the partitioning Vk only in an abstract way. However, to be able to efficiently implement (6.1) we require a parametric model of the partitioning. Given a set of target positions, the parametric model has to partition the image into a set of non-overlapping regions, such that each region contains only a single target. One way to achieve such a partitioning is to construct a Voronoi diagram [45] which is completely defined by the set of
N
points/seeds S = {(j)s}Nj =p1. The Voronoi diagram generates a set of Np pairwise-
106
Tracking multiple interacting targets
(a)
(b)
(c)
Figure 6.2: A bird’s-eye view of handball players on the court (a). The centers of the players are shown in (b) by white dots and the Voronoi diagram corresponding to the centers is shown in white lines. The mask function corresponding to the sixth partition is shown in (c).
disjoint convex partitions Vk = {(j)V}Nj =p1, such that each partition contains exactly one seed. For every point in the particular partition the closest seed is then the one encapsulated by that partition. An example of the Voronoi diagram among Np = 7 seeds corresponding to the positions of the seven targets (Figure 6.2a) is shown in Figure 6.2b.
6.3    Context-based multiple target tracking
The formulation of (6.1) tells us that if we know the current partitioning Vk then we can estimate the joint-state pdf conditioned on Vk by estimating its marginals p((j)xk|y1:k, Vk) separately. Note also that p((j)xk|y1:k, Vk) is just the posterior of the j-th target state at k-th time-step and can be estimated using a particle filter.
In our implementation, each target is tracked using the particle filter with a two-stage dynamic model which was proposed in Chapter 5, while the target’s visual properties are modelled by the color-based probabilistic model from Chapter 3. We can therefore restrict the j-th tracker to its partition (j)Vk by using an additional mask function (j)MV (u), defined as
(j)MV (u)

u?(j)Vk otherwise
(6.2)

6.3 Context-based multiple target tracking
107
where u is a pixel contained by the region ^Vk. The color-based probabilistic visual model from Chapter 3 already uses a mask function to mask out the pixels in the image that are likely to come from the background (see equation 3.5); we now denote that mask function by «Mjj(u). In the context of the multiple target tracking we have to redefine the mask function «M(u) in the color-based visual model (3.1) for the j-th target as an intersection of the mask functions {j)MD(-) (3.5) and %y(-)
%(u) = %D(u)n%y(u).                              (6.3)
The superscript <')(•) in (6.3) emphasizes that all the masks are target-dependent. Thus the mask function ^M(u) not only masks out the pixels that are likely to come from the background, but also those pixels which do not correspond to the partition of the j-th target. An example of the partition mask function ^Mv for the sixth target from Figure 6.2b is shown in Figure 6.2c.
In reality, prior to the tracking iteration, the true positions of the targets are not known. The posterior of the joint-states thus involves an integration over all the possible Voronoi configurations
np p(Xt\yl:k)= /   p(Vfc|yi:fc)Ttp(^xfc|y1:fc,Vfc).
k                              3 = 1
This integral could in principle be approximated via a Monte Carlo integration; however, due to the complexity of the problem at hand, this may lead to a computational load that would be too large for practical applications.
As an alternative, we propose a sub-optimal solution where prior to the tracking iteration the partitioning Vfc is estimated and used to carry out the tracking recursions for each target independently and in a sequential manner. This solution is described next.
Initially, the Voronoi partitioning is estimated via regularized predictions of all targets
W~, = Wo,_1 + ?i«v,_i.                                  (6.4)
In (6.4), ü')0fc_i and ^vk-i are the regularized estimate of the target’s state and velocity (5.28, 5.25), respectively, of the j-th target from the previous time-step. We assume that the regularized states with the larger weights 7rfc_i (Section 5.2) are more likely to have been properly estimated in the previous time-step than those with the smaller weights.   Therefore, the target with the
108
Tracking multiple interacting targets
largest weight 7rfc_i is chosen and the single-target tracking iteration is carried out for that target using the initially estimated Voronoi partitioning. The current regularized state of the target is then calculated and used to update the Voronoi partitioning. Next, the target with the second-largest weight 7rfc_i is selected and the procedure is repeated until all the single-target trackers are processed. A summary of the proposed, context-based multi-target tracker for Np targets is given in Algorithm 6.1.
In principle, the sequential recursing through single-target trackers described above could be repeated a few times in order to arrive at a better estimation of the current partitioning Vfc. This would then lead to better estimates of the single-target posteriors. However, in our experience, a single iteration is sufficient to achieve satisfactory results.
6.4    Experimental study
To evaluate the effectiveness of the multi-target interaction scheme from Section 6.3 we have compared the proposed context-based multi-target tracker (Algorithm 6.1), we denote it by TMtt, to the so-called naive tracker, which we denote by Tnaive. The naive tracker was conceptually equal to the proposed Tmtt, with the only difference that the Voronoi mask functions ^Mv (6.2) were always set to unity for all the targets. Thus the tracker Tnaive was essentially a set of color-based probabilistic trackers from Chapter 3 that did not interact according interaction scheme from Section 6.3
The trackers were compared by tracking players from two recordings of a handball match and two recordings of a basketball match. Throughout the rest of this section we will refer to the handball and the basketball recordings as Handballl, Handballž, Basketballl and Basketball. A typical image from each recording is shown in Figure 6.3.
6.4.1     Description of the recordings
Two teams, each consisting of six players, were tracked in the recordings of the handball matches (Figure 6.3a,b). The players of one team were wearing white shirts, and the players of the other team were wearing black shirts. The color of the court was mainly yellow and blue, with a few advertisement stickers on
6.4 Experimental study
109
(a)
Oh
Basketballl
(b)
	'H 1	^8
i flrcsr		
		
		Basketball2
(c)
(d)
Figure 6.3: Typical images from the four recordings used in the experiments for tracking multiple persons. The first two images (a,b) show the twelve players of a handball match, while the second two images (c,d) show the ten players of a basketball match. All the players are depicted by a circle and a numeric label.
110
Tracking multiple interacting targets
Initialize:
• Initialize the tracker by selecting the targets. (e.g. manually)
Tracking:
• For k = 1,2,3,...
1.   Sort the targets in a descending order in terms of the weights vrfc_i of their corresponding regularized states ofc_i.
2.   Initialize all the seeds with the predicted states (6.4):
3.   For j = 1: Np
-  Construct a set of Voronoi partitions Vfc = {^V*}^ using the set of the current seeds S.
- Construct the Voronoi mask ü)My(u) of the j-th target via (6.2) and calculate the single-target mask function ^M(u) from (6.3).
-  Execute  an  iteration  of the  two-stage  probabilistic  tracker  from Algorithm 5.1 for the j-th target.
-  Update the j-th Voronoi seed with the regularized average state of the j-th target: (j)  ^ ^ok.
Algorithm 6.1:  Context-based multiple-interacting-target tracker.
it. Because of the reflective properties of the material from which the court was made, and because of the side effects associated with using S-VHS tape for the video recording, the textures of the players varied significantly across different parts of the court. For example, white players appeared yellow on the yellow part of the court and blue on the blue part of the court. The textures of the black players were less affected by the color of the court.
In the experiments with the basketball matches, two teams, each consisting of five players, were tracked. In both recordings (Figure 6.3c,d) the colors of the players were not influenced by the background as severely as they were in the recordings of the handball. Since all four recordings were originally recorded on an analog VHS recorder prior to digitization, they suffered from an effect called
6.4 Experimental study
111
Table 6.1: Data for the recordings used in the experiments of tracking multiple persons
recording        frame rate     number of players      length      image size
[/s]                                           [frames]        [pixels]
Handball1              25                       12
Handball2              25                       12
Basketball1            25                       10
Basketball2            25                       10
color bleeding. This resulted in bright colors spreading into the adjacent darker areas. For example, in Figure 6.3a,b, the yellow patch of the court seems to be shifted to the right by a few pixels. Further information regarding the recordings is given in Table 6.1.
6.4.2    Results
The players were initialized manually and tracked throughout the entire recording. When a particular player was lost, the tracker was manually reinitialized for that player and the tracking proceeded. The number of particles used for the tracking was 25 particles per player. In the recordings of the handball, the widths and heights of the players’ ellipses were constrained to lie within the interval [6,8] pixels. In the case of the basketball recordings the interval [6,10] pixels was used. All the players were tracked five times with both trackers, and for each repetition the number of times the tracker failed was recorded. The results, averaged over the five repetitions, are shown in Table 6.2.
The second and the third column of Table 6.2 show the average number of failures encountered by the trackers TMTT and Tnaive during the experiment. These columns show that in all cases the introduction of the multi-target context (Section 6.1) substantially reduced the number of failures and thus significantly improved the tracking. The results of the two columns could not be compared directly across different recordings because the recordings differed in their lengths as well as in the number of players. For this reason the results for each experiment were recalculated into failure rates per player and then normalized to a time-frame of one minute. These results are shown in the last two columns of Table 6.2. From the fourth column we see that the failure rates were approximately equal
935          348x288
1264         348x288
733          368x288
566          368288
112
Tracking multiple interacting targets
Table 6.2: The results for tracking multiple players with the proposed and the naive multi-target tracker
	average number of failures		failure rate		per player
		[/recording]		[/min]	
recording	Tnaive	Tmtt	Tnaive		Tmtt
Handball1	28.6	3.6	3.82		0.48
Handball2	33.0	7.6	3.26		0.75
Basketball1	13.4	3.0	2.74		0.61
Basketball2	4.4	0.4	1.17		0.11
The naive tracker is denoted by Tnaive, while the proposed multi-player tracker is denoted by TMTT. The second and the third columns show the average number of failures encountered by each tracker during the experiment. The last two columns show the same results recalculated to represent the number of times each tracker is expected to fail per player during one minute of tracking.
for all four experiments with the Tnaive tracker. While in comparison to Tnaive the proposed multi-player tracker TMTT significantly reduced the failure rates, there were still some small residual failure rates present. These varied across the four recordings, as can be seen by comparing the results in the last column of Table 6.2. After a further inspection of the tracking results, we concluded that the residual failures could be assigned to one of the following four groups of errors:
1. Some of the failures arose solely due to a heavily cluttered background, and were not caused by interactions among the neighboring players. A substantial number of the failures in the recording Handball2 could be attributed to the background clutter. In this match, for example, the player with the identification number 1 (Figure 6.3b) could hardly be distinguished from the background (see Figure 6.4).
2. Sometimes two similar players came very close to one another and were switched by the tracker despite the use of Voronoi partitioning. Such failures occurred only rarely, usually when just before the (near) collision the position and prediction of at least one of the colliding players was poorly estimated. An example of the switching of two black players in the recording Basketball2 is shown in Figure 6.5.
6.4 Experimental study
113
3.  Objects that were not tracked caused problems when they were in close proximity to the visually similar tracked players. One such situation consistently caused failures in the recording Basketball1 when a white player was moving close to a white referee (see Figure 6.6a). To demonstrate how such failures can be prevented, we have tracked the referee from Figure 6.6 and the tracker was able to maintain a correct track of the white player; results are shown in Figure 6.6b.
4.  Sometimes the collision of several players on a cluttered part of the court resulted in failures of the tracker. This was the case in the recording Handball2, where three players collided and crossed the goal-area line (Figure 6.7). The situation was especially difficult because this was the place where the color of the court changed from yellow to blue. Because of the previously mentioned effect of color bleeding and the court’s reflectance properties (Section 6.4.1), the players appeared to change their colors very quickly as they crossed the line. This introduced additional ambiguities and ultimately caused a failure.
Figure 6.4: Figures show the handball player from Figure 6.3b with the identification number 1, who is hardly distinguishable from the court due to the background clutter. The player is depicted by a white circle.
As we have pointed out in point three above, tracking may fail in situations where players move close to other visually similar objects that are not tracked. We have seen in Figure 6.6 that in some cases these situations can be resolved within the proposed tracking framework by tracking those objects as well. We have thus repeated the experiment with the recording Basketball1 where we have also tracked the referee that was responsible for the failure described in Figure 6.6. The results for TMTT have improved by reducing the number of failures per
114
Tracking multiple interacting targets
Figure 6.5: Figures show two visually similar players in a basketball match during a near collision. The players’ true identities are indicated by the numbers in the white squares. The tracker-estimated states and the corresponding identities are depicted by white ellipses and the Voronoi partitioning is indicated by a white line separating the players. Note that before the collision the markers with the same identification number denote the same players (left). Just before the players pass by one another, the state of the player with the identification number 2 is badly estimated (middle), and the tracker switches their identities (right).
experiment by one failure. However, in real-life there are many situations where it is not possible or even desired to track all objects in the scene. For example, in the field of sports analysis, sport experts are usually interested in the teams, or a selection of players, rather than everyone on the court. We have observed that sports experts usually track a selection of players at a time. The reason is that switching two players, or improper tracking of a single player, could have a devastating effect on the subsequent analysis that sports experts perform. Therefore, situations where a player is tracked, and the referee (or even another player) is not, are common in practice. In those situations, failures like the one described in Figure 6.6a can be expected.
In general, the proposed, multi-target tracker TMTT exhibited a robust performance and maintained a successful track even through the persistent collisions of several visually similar players. A sequence of three images from the recording Handball1 (Figure 6.8) shows an example, where several players collide and remain in collision. The tracker TMTT successfully tracks all the players throughout the collision while maintaining their identities.
The trackers used in the experiments were implemented in C++ and tested on an Intel Pentium 4 personal computer with a 2.6-GHz CPU. A one time-step iteration for tracking a single player took approximately 7 ms of processing time.
6.4 Experimental study
115
(a)
(b)
Figure 6.6: Figures show a white player passing by a white referee. The upper row (a) shows results when only the player is tracked. The location of the player is depicted by an arrow, while the estimated state is depicted by the ellipse. As the player passes by the referee (middle), the tracker switches to the referee and the tracking fails (right). The bottom row (b) shows results when both, the player and the referee, are tracked and tracking does not fail. The white line between the players depicts the Voronoi partitioning.
Figure 6.7: Three players from the recording Handball2 are shown as they collide along the goal-area line. Each player is depicted by a numeric label (1,2 and 3) and an arrow (left). The players change their color as they cross from the yellow part of the court to the blue part (middle and right).
Since the bottleneck of the algorithm is the construction of the color histograms
116
Tracking multiple interacting targets
(Section 3.1), the processing time increases with the player’s size. When tracking multiple players with TMTT, the processing time was proportional to the number of players plus the time required to construct the Voronoi regions. For example, a single iteration to track the twelve players of the handball match with Tnaive took approximately 86 ms, while TMTT took approximately 108 ms. This means that approximately 22 ms was spent on the construction of the Voronoi regions and the corresponding mask functions.
Figure 6.8: The top row shows consecutive frames, 681, 699 and 713, from the recording Handball1, where several players collided or moved close to each other. The tracking result is shown in the bottom row where the Voronoi partitioning is drawn with black lines and the players are depicted by the ellipses.
6.5    Conclusion
In this chapter we have proposed a context-based multiple target tracking algorithm. We have focused on the applications in which the camera is positioned such that it observes the scene from a bird’s-eye view. In the context of observing the scene from above, we have derived restrictions which simplify tracking of multiple targets. These restrictions tell us that the image of the scene can be partitioned into a set of nonoverlapping regions, such that each region contains only a single target. We have formulated these restrictions by proposing
6.5 Conclusion
117
a parametric model of the image partitioning. We have proposed a scheme to integrate the parametric model within the particle filtering framework. In Bayesian terms, the parametric model of partitions acts as a latent variable which simplifies the Bayes filter for tracking multiple targets and allows that each target is tracked by a separate tracker. This significantly reduces the computational complexity of the multiple target tracking problem. A two-stage dynamic model from Chapter 5 is used within the particle filters for each target, which further reduces the number of particles required for tracking and thus makes the tracker computationally more efficient.
The proposed context-based multiple-target tracker was tested on a demanding data-set which contained recordings of handball and basketball matches. The tracker was compared to another, reference, tracker which was equal to the proposed tracker in all respects, except that the reference tracker did not make use of the partitioning model and was equal to a set of independent single-target trackers. In all experiments, the proposed tracker outperformed the reference tracker by significantly reducing the number of failures.
Note that the proposed context-based multiple-target tracker is a framework of how to combine single-target trackers to approximate the joint-state Bayes recursive filter. Thus its performance can be increased merely by replacing the single-target particle filters with other, more advanced trackers. Furthermore, we believe that this framework does not necessarily need to be applied in the context of probabilistic tracking with particle filters. Other, non Bayesian, single-target trackers may also be combined using this framework into a multiple-target tracker to improve their performance in tracking multiple targets.
118                                                             Tracking multiple interacting targets
I hate quotations. Tell me what you know.
Ralph W. Emerson (1803 — 1882)
Chapter 7
Conclusion
In this thesis we have focused on probabilistic models for tracking persons in visual data. Tracking was defined within the context of probabilistic estimation, where the parameters of the target’s model are considered random variables and the aim is to estimate, recursively in time, the posterior probability density function over these parameters. The recursive estimation was approached within the established Bayesian framework of particle filtering. Several aspects of tracking persons have been considered in this thesis: (1) how to build a reliable visual model of a person, (2) how to efficiently model the person’s dynamics, and (3) how to devise a scheme to track multiple persons.
One of the essential parts of visual tracking is the visual model, which allows us to evaluate whether a person is located at a given position in the image. We have based our visual model on color histograms and proposed several improvements that consider tracking using the color information. The first improvement was the color-based measure of the target’s presence that uses information from the approximation of the background image to reduce the influence of the background clutter. Using model-selection methodology and maximum likelihood estimation we have then proposed the likelihood function, which can be used to probabilistically interpret values of the proposed target’s presence measure. However, in cases when the target is moving on those parts of the background that are very similar to the color of the target, the proposed measure of presence may not be discriminative enough. For that reason we have considered the background subtraction, i.e., generating a mask function which masks out pixels in the current image that do not belong to the target. In situations where the lighting of the scene is changing, or the camera is moving or shaking, it is usually
119
120
Conclusion
difficult to obtain an accurate model of the background. For that reason we have considered using only a simple approximation to the background and proposed a dynamic background subtraction. In our implementation, the mask function is then generated by evaluating the similarity between the tracked target and the background model and is in that sense specialized to the tracked target. Another improvement is the selective adaptation of the target visual model, which is used to guard against updating the color-based visual model in situations where the position of the target is falsely estimated or when the target is being occluded. We have also shown how these improvements are probabilistically combined, within the framework of particle filters, into a color-based probabilistic tracker.
Experimental results have shown that the proposed purely color-based probabilistic visual model significantly improves tracking performance in cases when the color of the background is similar to the color of the tracked object and can handle short-term occlusions between visually-different objects. However, it cannot handle situations in which the target is in a close proximity of another, visually similar, object, or worse yet, when it is occluded by it. We have therefore proposed to extend the purely color-based model with another model, which we have named the local-motion model. The local-motion model is calculated from the optical flow which we estimate using the Lucas-Kanade algorithm. While the Lucas-Kanade algorithm is relatively fast, it gives poor estimates of the optical flow in regions which lack texture. For that reason, we use the Shi-Tomasi features to detect regions with enough texture and estimate the optical flow only at those regions. Thus the local-motion is defined using only a sparse optical flow. To account for the errors in the optical flow estimation and rapid changes in the target’s motion, we have derived a probabilistic model of the local-motion. Since the local-motion model significantly varies during target’s movement, an adaptation scheme for the local-motion model was devised. The proposed local-motion model is probabilistically combined with the color-based model into a combined visual model and we have proposed a particle-filter-based tracker which uses this model. Experiments have shown that the proposed combined visual model is able to resolve occlusions between visually-similar objects. We have demonstrated these improvements with examples of tracking person’s palms, as well as tracking persons in surveillance and in a sports match.
We have then focused on another essential part of a probabilistic tracker, the dynamic model, by which we describe the dynamics of the tracked target.
121
We have proposed a two-stage dynamic model, and a corresponding two-stage probabilistic tracker, which can account for various types of motions that are characteristic for a moving person. The proposed model is composed of two separate dynamic models. The first dynamic model is called the liberal dynamic model and was derived from a non-zero-mean Gauss-Markov process. An analysis of the parameters of the liberal model has shown that two widely-used models, the random-walk (RW) model and the nearly-constant-velocity (NCV) model, are obtained at the limiting values of the model’s parameters. We have also noted that the liberal model can explain even motions which are in between the RW and the NCV model. An important parameter of the liberal model is the spectral density of the Gauss-Markov process, which depends on the dynamics of the class of objects to be tracked. We have therefore derived a rule-of-thumb rule to selecting this density, which requires only a vague estimate of the target dynamics. Furthermore, by controlling the mean value of the Gauss-Markov process, the liberal model can even further adjust to the dynamics of the tracked target. To efficiently estimate this mean value in the liberal model, another dynamic model, which we call the conservative model, was proposed. In contrast to the liberal model, which allows larger perturbations in target’s motion, the conservative model assumes stronger constraints on the target’s velocity. The proposed two-stage probabilistic tracker uses the liberal dynamic model within a particle filter to efficiently explore the state space of the tracked target. On the other hand, the conservative model is used to estimate the mean value of the Gauss-Markov process in the liberal model, as well as for regularizing the estimations from the particle filter. Experiments have shown that, when used in a particle filter, the proposed dynamic model outperforms the widely-used dynamic models by reducing the number of times a target is lost and achieving a better accuracy of target’s position and prediction. The proposed two-stage dynamic model also allows using a smaller number of particles in the particle filter, which significantly reduces the processing time required for a single tracking iteration.
In the last part of the thesis we have considered extending the proposed solutions for single-target tracking to the problem of tracking multiple persons. We have focused on applications in which the camera observes the scene from a bird’s-eye view, and proposed a novel context-based multiple-target tracking algorithm. In the context of observing the scene from above, we have derived restrictions which simplify the tracking of multiple targets.  These restrictions
122
Conclusion
tell us that the image of the observed scene can be partitioned into a set of nonoverlapping regions, such that each region includes only a single target. We have formulated these restrictions by proposing a parametric model of the image partitioning. We have proposed a scheme to integrate the parametric model within the particle filtering framework. In Bayesian terms, the parametric model of partitions acts as a latent variable which simplifies the Bayes filter for multiple target tracking and allows tracking each target with a separate tracker. This significantly reduces the computational complexity of the multiple target tracking problem. The previously proposed two-stage dynamic model is used within the particle filter for each target, which further reduces the number of particles required for tracking and makes the tracker more efficient in terms of the processing time. The proposed context-based multiple-target tracker was tested on a demanding data-set which contained recordings of handball and basketball matches. The tracker was compared to another, reference, tracker which was equal to the proposed tracker in all respects, except that the reference tracker did not make use of the partitioning model and was equal to a set of independent single-target trackers. In all experiments, the proposed tracker outperformed the reference tracker by significantly reducing the number of failures.
The solutions which were proposed in this thesis allow accurate tracking while at the same time do not significantly increase the processing time. As such they are ideal for use in applications like surveillance and analysis in sports. Indeed, some of the solutions have already been integrated in an application for analysis of performance of athletes in indoor sports like squash, tennis, basketball and handball [110]. This application has been successfully used by sport experts from the University of Ljubljana and the Ruhr University of Bochum to produce an analysis of players’ motion during tennis, basketball and handball.
7.1    Summary of contributions
The contributions of the thesis are summarized below:
• A color-based visual model for tracking persons is derived, which improves tracking in situations when the color of the tracked object is similar to the color of the background. Partially published in [89, 90, 92, 91].
7.2 Future work
123
•  A combined visual model is proposed, which fuses the color information with the features of local motion, to resolve occlusions between visually similar objects.
Partially published in [88], another paper currently under submission.
•  A two-stage dynamic model is proposed, which combines the liberal and the conservative model to better describe the target’s motion, and a method for setting the parameters of the model is derived.
Partially published in [87].
•  A context-based scheme for tracking multiple targets is proposed, which allows tracking with a linear computational complexity. Published in [91].
7.2    Future work
While the proposed solutions and models perform well under a variety of demanding conditions there are still some points which could be improved. The form of the color-based visual model was chosen to be very general to allow tracking entire persons as well as parts of persons (hands). The model could be further improved by accounting for the structure of the target as well. For example, the visual model could be comprised of multiple (loosely or rigidly) connected separate visual models. This would improve tracking in situations where, for example, the color of the person’s upper part is significantly different from the lower part. Perhaps the structure of these connected appearance models could be relaxed even further – however, care should be taken not to decrease the robustness of the color-based model. While better color-based model could improve tracking to some extent, it would likely still suffer from the drawbacks of the color-based visual model. Namely, it would still not be able to discriminate between two visually similar objects. We have shown that the proposed combined visual model offers a marked improvement when tracking visually similar objects through occlusions. However, this model still suffers from one drawback. If the visually similar objects move in the same way, then the model still cannot distinguish between them and tracking fails. Also, if the tracked object drastically changes the direction of motion during the occlusion, the local-motion model
124
Conclusion
will not be able to detect that, and tracking will fail. In situations, where the camera is placed such that it observes the scene from above, we can resolve the situations where visually similar objects collide and move similarly by applying the proposed context-based multi-target tracker. However, we have still observed some situations, where the partitioning is falsely estimated and the tracking fails. Therefore, a method to automatically recover tracking once a target has been lost and to detect targets as they enter the scene is still required. This is indeed a difficult problem in its own right and will be the focus of further research. Note that from the point of detecting the object onward, the solutions which have been proposed in this thesis can be used directly with no, or only minor, modifications.
References
[1] A. Agarwal and B. Triggs. Tracking articulated motion with piecewise learned dynamical models. In Proc. European Conf. Computer Vision, 2004.
[2] J. K. Aggarwal and Q. Cai. Human motion analysis: A review. Comp. Vis. Image Understanding, 73(3):428–440, 1999.
[3] H. Akaike. A new look at the statistical model identification. IEEE Trans. Automatic Control, 19:716–723, December 1974.
[4] D. L. Alspach and H. W. Sorenson. Nonlinear Bayesian estimation using Gaussian sum approximation. IEEE Trans. Automatic Control, 20:439–447, 1972.
[5] B. D. O. Anderson and J. B. Moore. Optimal Filtering. New Jersy: Prentice-Hall inc., 1979.
[6] M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for online nonlinear/non-gaussian Bayesian tracking. IEEE Trans. Signal Proc., 50(2):174–188, February 2002.
[7] R. V. Babu, P. Perez, and P. Bouthemy. Robust tracking with motion estimation and local kernel-based color modeling. Image and Vision Computing, 25(8):1205–1216, August 2007.
[8] J. Bangsbo. The physiology of soccer: With special reference to intense intermittent exercise. Acta Physiologica Scandinavica, 619:1–155, 1994.
[9] Y. Bar-Shalom, editor. Multitarget/Multisensor Tracking: Applications and Advances, volume 2. YBS Publishing, 1998.
125
126
References
[10] Y. Bar-Shalom, X. R. Li, and T. Kirubarajan. Estimation with Applications to Tracking and Navigation, chapter 11, pages 438–440. John Wiley & Sons, Inc., 2001.
[11] M. Basseville. Distance measures for signal processing and pattern recognition. Signal Processing, 18(4):349–454, 1989.
[12] A. Baumberg and D. Hogg. An efficient method for contour tracking using active shape models. In Proc. IEEE Workshop on Motion of Non-rigid and Articulated Objects, pages 194–199, 1994.
[13] N. Bergman. Recursive Bayesian Estimation: Navigation and Tracking Applications. PhD thesis, Linkoping University, 1999.
[14] S. T. Birchfield and S. Rangarajan. Spatiograms versus histograms for region-based tracking. In Proc. Conf. Comp. Vis. Pattern Recognition, volume 2, pages 1158–1163, 2005.
[15] C. M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer Science+Business Media, LCC, 2006.
[16] M. J. Black and P. Anadan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fileds. Comp. Vis. Image Understanding, 63:75–104, 1996.
[17] M. J. Black and D. J. Fleet. Probabilistic detection and tracking of motion discontinuities. In Proc. Int. Conf. Computer Vision, volume 1, pages 551– 558, 1999.
[18] M.J. Black. Robust dense optical flow.
http://sca2002.cs.brown.edu/?black/ignc.html. last visited on 21. december, 2006.
[19] A. Blake and M. Isard. Active Contours: The Application of Techniques from Graphics,Vision,Control Theory and Statistics to Visual Tracking of Shapes in Motion. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1998.
[20] H. A. P. Blom and E. A. Bloem. Exact bayesian and particle filtering of stochastic hybrid systems. IEEE Trans. Aerospace and Electronic Systems, 43(1):55–70, 2006.
References
127
ˇ [21] M. Bon, J. Perˇs, M. Sibila, and S. Kovaˇciˇc. Analiza gibanja igralca med
tekmo. Faculty of Sport, University of Ljubljana, 2001.
[22] J.   Y.   Bouguet.       Pyramidal   implementation   of   the   lucas   kanade
feature   tracker:     Description   of   the   algorithm.       Technical   report,
Intel  Corporation,   Microprocessor  Research  Labs,   last  visit:     2007. http://www.intel.com/research/mrl/research/opencv/.
[23] K. J. Bradshaw, I. D. Reid, and D. W. Murray. The active recovery of 3d motion trajectories and their use in prediction. IEEE Trans. Patter. Anal. Mach. Intell., 19(3):219–234, 1997.
[24] P. Brasnett, L. Mihaylova, D. Bull, and N. Canagarajah. Sequential Monte Carlo tracking by fusing multiple cues in video sequences. Image and Vision Computing, 25(8):1217–1227, August 2007.
[25] G. J. Brostow and R. Cipolla. Unsupervised bayesian detection of independent motion in crowds. In Proc. Conf. Comp. Vis. Pattern Recognition, pages I: 594–601, 2006.
[26] R. G. Brown and P. Y. C. Hwang. Introduction to Random Signals and Applied Kalman Filtering. John Wiley & Sons, 1997.
[27] R. S. Bucy and H. M. Youssef. Nonlinear filter representation via spline functions. In Symposium on Nonlinear Estimation, pages 51–60, 1974.
[28] A. Bugeau and P. P´erez. Detection and segmentation of moving objects in highly dynamic scenes. In Proc. Conf. Comp. Vis. Pattern Recognition, pages 1 – 8, 2007.
[29] K. P. Burnham and D. R. Anderson. Multimodel inference: Understanding aic and bic in model selection. Sociological Methods and Research, 33(2):261–304, 2004.
[30] Y. Cai, N. de Feritas, and J. J. Little. Robust visual tracking for multiple targets. In Proc. European Conf. Computer Vision, volume IV, pages 107– 118, 2006.
[31] J. Carpenter, P. Clifford, and P. Fearnhead. Improved particle filter for nonlinear problems. IEE Proceedings Radar, Sonar and Navigation, 146(1):2–7, 1999.
128
References
[32] C. Chang, R. Ansari, and A. Khokhar. Multiple object tracking with kernel particle filter. In Proc. Conf. Comp. Vis. Pattern Recognition, volume 1, pages 566– 573, 2005.
[33] Z. Chen. Bayesian filtering: From Kalman filters to particle filters, and beyond. Technical report, McMaster University, Hamilton, Ontario, Canada, 2003.
[34] K. Choi, Y. D. Seo, and S. W. Lee. Probabilistic tracking of soccer players and ball. Proc. Asian Conf. Computer Vision, 1:27–30, January 2004.
[35] V. D. Comaniciu and P. Meer. Kernel-based object tracking. IEEE Trans. Patter. Anal. Mach. Intell., 25(5):564–575, May 2003.
[36] T. F. Cootes and C. J. Taylor. Active shape models – ’smart snakes’. In Proc. British Machine Vision Conference, pages 266–275, 1992.
[37] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Training models of shape from sets of examples. In Proc. British Machine Vision Conference, pages 9–18, 1992.
[38] D. Cremers. Dynamical statistical shape priors for level set-based tracking. IEEE Trans. Patter. Anal. Mach. Intell., 28(8):1262– 1273, 2006.
[39] J. Czyz, B. Ristic, and B. Macq. A color-based particle filter for joint detection and tracking of multiple objects. In Proc. Int. Conf. Acoustics, Speech and Signal Processing, March 2005.
[40] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. Conf. Comp. Vis. Pattern Recognition, volume 1, pages 886–893, June 2005.
[41] J. Deutscher, A. Blake, and I. Reid. Articulated body motion capture by annealed particle filtering. In Proc. Conf. Comp. Vis. Pattern Recognition, volume 2, pages 126 – 133, 2000.
[42] M. Dimitrijevic, V. Lepetit, and P. Fua. Human body pose detection using bayesian spatio-temporal templates. Comp. Vis. Image Understanding, 104(2):127 – 139, 2006.
References
129
[43] A. Doucet, N. de Freitas, and N. Gordon, editors. Sequential Monte Carlo Methods in Practice. New York: Springer-Verlag, January 2001.
[44] W. Du and J. H. Piater. Tracking by cluster analysis of feature points using a mixture particle filter. In Advanced Video and Signal Based Surveillance, pages 165– 170, 2005.
[45] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, chapter Nonparametric Techniques, pages 177–178. Wiley-Interscience Publication, 2nd edition, 2000.
[46] W. S. Erdmann. Gathering of kinematic data of sport event by televising the whole pitch and track. In Proc. Int. Soc. Biomech. Sports Sympossium, pages 159–162, 1992.
[47] P. Fearnhead. Sequential Monte Carlo methods in filter theory. PhD thesis, Merton College, 1998.
[48] A. Fossati, M. Dimitrijevic, V. Lepetit, and P. Fua. Bridging the gap between detection and tracking for 3d monocular video-based motion capture. In Proc. Conf. Comp. Vis. Pattern Recognition, pages 1–8, 2007.
[49] P. Gabriel, J. Verly, J. Piater, and A. Genon. The state of the art in multiple object tracking under occlusion in video sequences. In Proc. Advanced Concepts for Intelligent Vision Systems, page 166–173, 2003.
[50] A. Galata, N. Johnson, and D. Hogg. Learning variable length markov models of behaviour. Comp. Vis. Image Understanding, 81(3):398–413, 2001.
[51] D. M. Gavrila. The visual analysis of human movement: A survey. Comp. Vis. Image Understanding, 73(1):82–98, 1999.
[52] J. J. Gonzalez, Lim Ik Soo, P. Fua, and D. Thalmann. Robust tracking and segmentation of human motion in an image sequence. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, volume 3, pages 29–32, 2003.
[53] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-gaussian Bayesian state estimation. In IEE Proc. Radar and Signal Processing, volume 40, pages 107–113, 1993.
130
References
[54] H. Grabner, M. Grabner, and H. Bischof. Real-time tracking via on-line boosting. In Proc. British Machine Vision Conference, pages 47–56, 2006.
[55] B. Han and L. S. Davis. Robust observations for object tracking. In Proc. IEEE Int. Conf. Image Processing, volume 2, pages 442–445, 2005.
[56] R. Hemlick. Multitarget-Multisensor Tracking: Applications and Advances, volume 3, chapter IMM Estimator with nearest-neighbour Joint Probabilistic Data Association, pages 175–178. Artech House, inc., 2000.
[57] S. M. Herman. A particle filter approach to joint passive radar tracking and target classification. PhD thesis, University of Illinois, 2002.
[58] B. K. P. Horn and B. G. Schnuck. Determining optical flow. Technical report, Massachusetts Institute of Technology, 1980.
[59] K. Hotta. Adaptive weighting of local classifiers by particle filter. In International Conference on Pattern Recognition, volume 2, pages 610–613, 2006.
[60] W. Hu, T. Tan, L. Wang, and Maybank S. A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems, Man and Cybernetics, Part C: Applications and Reviews, 34(3):334–352, 2004.
[61] C. Hue, J. P. Le Cadre, and P. P´erez. A particle filter to track multiple objects. In IEEE Workshop on Multi-Object Tracking, pages 61–68, Vancouver, Canada, July 2001.
[62] S. S. Intille and A. F. Bobick. Visual tracking using closed-worlds. In Proc. Int. Conf. Computer Vision, pages 672–678, June 1995.
[63] M. Isard and A. Blake. Contour tracking by stochastic propagation of conditional density. In Proc. European Conf. Computer Vision, volume 1, pages 343–356, 1996.
[64] M. Isard and A. Blake. Condensation – conditional density propagation for visual tracking. Int. J. Comput. Vision, 29(1):5–28, 1998.
[65] M. Isard and J. MacCormick. Bramble: A Bayesian multiple-blob tracker. In Proc. Int. Conf. Computer Vision, pages 34–41, 2001.
References
131
[66] S. Iwase and H. Saito. Parallel tracking of all soccer players by integrating detected positions in multiple view images. In International Conference on Pattern Recognition, volume 4, pages 751–754, 2004.
[67] A. H. Jazwinski. Stochastic Processes and Filtering Theory. New York: Academic Press, 1970.
[68] A. Jepson, D. J. Fleet, and T. El-Maraghi. Robust on-line appearance models for vision tracking. IEEE Trans. Patter. Anal. Mach. Intell., 25(10):1296–1311, 2003.
[69] S. Julier and J. Uhlmann. A general method for approximating nonlinear transformations of probability distributions. Technical report, Department of Engineering Science, University of Oxford, 1996.
[70] S. Julier and J. Uhlmann. A new extension of the kalman filter to nonlinear systems. In International Symp. on Aerospace/Defence Sensing, Simulation and Control, page 21–24, 1997.
[71] S. J. Julier and J. K. Uhlmann. A consistent, debiased method for converting between polar and cartesian coordinate systems. In Proceedings of AeroSense: The 11th International Symposium on Aerospace/Defense Sensing, Simulation and Control, 1997.
[72] S.J. Julier. The scaled unscented transformation. volume 6, pages 4555– 4559, 2002.
[73] I. Kailath. Lecture on Wiener and Kalman Filtering. New York: SpringerVerlag, 1980.
[74] T. Kailath. The divergence and bhattacharyya distance measures in signal selection. IEEE Trans. Communications Technology, 15(1):52–60, 1967.
[75] T. Kailath. The innovations approach to detection and estimation theory. In Proc. of the IEEE, volume 58, pages 680–695, 1970.
[76] R. E. Kalman. A new approach to linear filtering and prediction problems. Trans. ASME, J. Basic Engineering, 82:34–45, 1960.
[77] J. Kang, I. Cohen, and G. Medioni. Soccer player tracking across uncalibrated  camera  streams.      In  IEEE  Int.  Workshop  on  Visual
132
References
Surveillance and Performance Evaluation of Tracking and Surveillance In conjunction with ICCV03, pages 172–179, 2003.
[78] J. Kang, I. Cohen, and G. Medioni. Persistent objects tracking across multiple non-overlapping cameras. In Workshop on Motion and Video Computing, 2005.
[79] K. Kazufumi, I. Xiong. Gaussian filters for nonlinear filtering problems. IEEE Trans. Automatic Control, 45(5):910–927, 2000.
[80] S. Khan and M. Shah. Tracking people in presence of occlusion. In Asian Conference on Computer Vision, 2000.
[81] Z. Khan, T. Balch, and F. Dellaert. MCMC-based particle filter for tracking a variable number of interacting targets. IEEE Trans. Patter. Anal. Mach. Intell., 27(11):1805–1819, November 2005.
[82] G. Kitagawa. Non-gaussian state-space modelling of nonstationary time series. J. Amer. Stat. Assoc., 82:1032–1063, 1987.
[83] G. Kitagawa. Monte carlo filter and smoother for non-gaussian non-linear state space models. J. Comp. and Graph. Statistics, 5(1):1–25, 1996.
[84] B. Klaus and P. Horn. Robot Vision. The MIT press, Boston, 1986.
[85] C. Kotzamanidis, K. Chatyikoloutas, and A. Gianakos. Optimization of the training plan of the handball game. Handball: periodical for coaches, referees and lecturers, 2:65–71, 1999.
[86] S. C. Kramer and H. W. Sorenson. Recursive Bayesian estimation using piece-wise constant approximations. Automatica, 24(6):789–801, 1988.
[87] M. Kristan, J. Pers, A. Leonardis, and S. Kovacic. A hierarchical dynamic model for tracking in sports. In Proceedings of the sixteen Electrotechnical and Computer Science Conference, ERK07, September 2007.
[88] M. Kristan, J. Perˇs, A. Leonardis, and S. Kovaˇciˇc. Probabilistic tracking using optical flow to resolve color ambiguities. In Helmut Grabner Michael Grabner, editor, Computer Vision Winter Workshop 2007, pages 3–10, 2007.
References
133
[89] M. Kristan, J. Perˇs, M. Perˇse, M. Bon, and S. Kovaˇciˇc. Multiple interacting targets tracking with application to team sports. In International Symposium on Image and Signal Processing and Analysis, pages 322–327, September 2005.
[90] M. Kristan, J. Perˇs, M. Perˇse, and S. Kovaˇciˇc. Towards fast and efficient methods for tracking players in sports. In ECCV Workshop on Computer Vision Based Analysis in Sport Environments, pages 14–25, May 2006.
[91] M. Kristan, J. Perˇs, M. Perˇse, and S. Kovaˇciˇc. Closed-world tracking of multiple interacting targets for indoor-sports applications. Computer Vision and Image Understanding, 2008. in press.
[92] M. Kristan, M. Perˇse, S. Kovaˇciˇc, and J. Perˇs. Sledenje veˇc igralcev v ˇsportnih igrah na podlagi vizualne informacije. Electrotechnical Review, 74(1-2):19–24, May 2007.
[93] S. Kullback and R. A. Lieber. On information and sufficiency. Annals of Mathemathical Statistics, 22:679–86, 1951.
[94] T. Lefebvre, H. Bruyninckx, and J. De Schutter. Comment on ’a new method for the nonlinear transformations on means and covariances in filters and estimators’. IEEE Trans. Automatic Control, 47(8):1406–1409, 2002.
[95] B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and segmentation with an implicit shape model. In Proc. European Conf. Computer Vision, 2004.
[96] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. In Proc. Conf. Comp. Vis. Pattern Recognition, pages I: 878–885, 2005.
[97] P.H. Li and F. Chaumette. Image cues fusion for object tracking based on particle filter. In Intl. Conf. Articulated Motion And Deformable Objects, pages 99–107, 2004.
[98] W. R. Li and Y. Bar-Shalom. Performance prediction of the interacting multiple model algorithm. IEEE Trans. Aerospace and Electronic Systems, 29(3):755–771, 1993.
134
References
[99] Y. Li and H.Z. Ai. Fast detection of independent motion in crowds guided by supervised learning. In Proc. IEEE Int. Conf. Image Processing, volume 3, pages 341–344, 2007.
[100] H. Lim, V. I. Morariu, O. I. Camps, and M. Sznaier. Dynamic appearance modeling for human tracking. In Proc. Conf. Comp. Vis. Pattern Recognition, 2006.
[101] J. S. Liu and R. Chen. Sequential monte carlo methods for dynamical systems. J. Amer. Stat. Assoc., 93:1032–1044, 1998.
[102] W. L. Lu and J. J. Little. Tracking and recognizing actions at a distance. In Proc. Workshop on Compter Vision Based Analisys in Sport Environments In conjunction with ECCV06, pages 49–60, May 2006.
[103] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Imaging Understanding Workshop, pages 121–130, 1981.
[104] M. Lucena, J. M. Fuertes, and N. P. de la Blanca. Evaluation of three optical flow-based observation models for tracking. In International Conference on Pattern Recognition, volume 4, pages 236– 239, 2004.
[105] M. J. Lucena, J. M. Fuertes, J. I. Gomez, N. P´erez de la Blanca, and A. Garrido. Tracking from optical flow. In International Symposium on Image and Signal Processing and Analysis, volume 2, pages 651 – 655, 2003.
[106] J. MacCormick. Probabilistic modelling and stochastic algorithms for visual localisation and tracking. PhD thesis, University of Oxford, 2000.
[107] J. MacCormick and A. Blake. Probabilistic exclusion and partitioned sampling for multiple object tracking. Int. J. Comput. Vision, 39(1):57–71, 2000.
[108] R. Malladi, J. A. Sethian, and B. C. Vemuri. Shape modelling with front propagation: A level set approach. IEEE Trans. Patter. Anal. Mach. Intell., 17(2):158–175, 1995.
References
135
[109] T. Mauthner and H. Bischof. A robust multiple object tracking for sport applications. In Performance Evaluation for Computer Vision, 31st AAPR/OAGM Workshop 2007, pages 81–89, May 2007.
[110] Machine Vision Group,   Faculty of Electrical Engineering,   University   of Ljubljana. ISAA. http://vision.fe.uni-lj.si/research/SportA/application.html. last visited on 29 January 2008.
[111] S. McGinnity and G.W. Irwin. Multiple model bootstrap filter for maneuvering target tracking. IEEE Trans. Aerospace and Electronic Systems, 36(3):1006–1012, 2000.
[112] S. J. McKenna, Raja Y., and S. Gong. Object tracking using adaptive color mixture models. In Proc. Asian Conf. Computer Vision, page 607–614, 1998.
[113] A. S. Micilotta, E. J. Ong, and R. Bowden. Detection and tracking of humans by probabilistic body part assembly. In Proc. British Machine Vision Conference, 2005.
[114] T. B. Moeslund and E. Granum. A survey of computer vision-based human motion capture. Comp. Vis. Image Understanding, 81(3):231–268, March 2001.
[115] T.B. Moeslund, A. Hilton, and V. Kruger. A survey of advances in vision-based human motion capture and analysis. Comp. Vis. Image Understanding, 103(2-3):90–126, November 2006.
[116] C. J. Needham. Tracking and Modelling of Team Game Interactions. PhD thesis, School of Computing, The University of Leeds, October 2003.
[117] N. Norgraad, N. Poulsen, and O. Ravn. New developments in state estimation for nonlinear systems. Automatica, 36(11):1627–1638, 2000.
[118] K. Nummiaro, E. Koller-Meier, and L. Van Gool. Color features for tracking non-rigid objects. Chinese J. Automation, 29(3):345–355, May 2003.
[119] H. Ok, Y. Seo, and K. Hong. Multiple soccer players tracking by condensation with occlusion alarm probability. In Int. Workshop on Statistically Motivated Vision Processing, 2002.
136
References
[120] K. Okuma, A. Taleghani, N. De Freitas, J. J. Little, and D. G. Lowe. A boosted particle filter: Multitarget detection and tracking. In Proc. European Conf. Computer Vision, volume 1, pages 28–39, 2004.
[121] C. F. Olson and D. P. Huttenlocher. Automatic target recognition by matching oriented edge pixels. IEEE Trans. Image Proc., 6(1):103–113, January 1997.
[122] OpenCV. Lucas-Kanade optical flow calculation. OpenCV C++ library.
[123] P. P´erez, C. Hue, J. Vermaak, and M. Gangnet. Color-based probabilistic tracking. In Proc. European Conf. Computer Vision, volume 1, pages 661– 675, 2002.
[124] P. P´erez, J. Vermaak, and A. Blake. Data fusion for visual tracking with particles. Proc. of the IEEE, 92(3):495–513, 2004.
[125] J. Perˇs and S. Kovaˇciˇc. Tracking people in sport : Making use of partially controlled environment. In Int. Conf. Computer Analysis of Images and Patterns, pages 374–382, 2001.
[126] V. Peterka. Trends and Progress in System Identification, chapter Bayesian approach to system identification, pages 239–304. Pergamon Press, 1981.
[127] PETS: Performance Evaluation of Tracking and Surveillance. Online database. http://www.cvg.rdg.ac.uk/slides/pets.html, 2006. last visited: 4.4.2007.
[128] F. Piti´e, S. A. Berrani, A. Kokaram, and R. Dahyot. Off-line multiple object tracking using candidate selection and the viterbi algorithm. In Proc. IEEE Int. Conf. Image Processing, volume 3, pages 109–112, 2005.
[129] M. K. Pitt and N. Sheppard. Filtering via simulation: Auxiliary particle filters. J. Amer. Stat. Assoc., 94(446):590–599, 1999.
[130] S. Pundlik and S. Birchfield. Motion segmentation at any speed. In Proc. British Machine Vision Conference, volume I, pages 427–436, 2006.
[131] D. Ramanan, D. A. Forsyth, and A. Zisserman. Tracking people by learning their appearance. IEEE Trans. Patter. Anal. Mach. Intell., 29(1):65–81, 2007.
References
137
[132] C. Rasmussen and G. D. Hager. Probabilistic data association methods for tracking complex visual objects. IEEE Trans. Patter. Anal. Mach. Intell., 23(6):560–576, 2001.
[133] I. Rhodes. A tutorial introduction to estimation and filtering. IEEE Trans. Automatic Control, 16(6):688–706, 1971.
[134] B. Ripley. Stochastic Simulation. Wiley, New York, 1987.
[135] J. Rissanen. Hypothesis selection and testing by the MDL principle. Comp. Journal, 42(4):260–269, 1999.
[136] X. Rong Li and V. Jilkov P. Survey of maneuvering target tracking: Dynamic models. IEEE Trans. Aerospace and Electronic Systems, 39(4):1333–1363, October 2003.
[137] N. Saito and R.R. Coifman. Improved local discriminant bases using empirical probability density estimation. In Comput. Section of American Statistical Association, pages 312–321, 1997.
[138] D. Schulz, W. Burgard, D. Fox, and A. B. Cremers. Tracking multiple moving targets with a mobile robot using particle filters and statistical data association. In Proc. IEEE Int. Conf. Robotics and Automation, volume 1, pages 665–670, 2001.
[139] G. Schwarz. Estimating the dimension of a model. Annaly of Statistics, 6(2):461–464, 1978.
[140] D. W. Scott. Multivariate Density Estimation. New York: Wiley, 1992.
[141] A. Senior. Tracking people with probabilistic appearance models. In Perf. Eval. Track. and Surveillance in conjunction with ECCV02, pages 48–55, 2002.
[142] Y. Seo, S. Choi, H. Kim, and K. S. Hong. Where are the ball and players? soccer game analysis with color based tracking and image mosaick. In Proc. Int. Conf. Image Analysis and Processing, volume 2, pages 196–203, 1997.
[143] J. Shi and C. Tomasi. Good features to track. In Proc. Conf. Comp. Vis. Pattern Recognition, pages 593 – 600, June 1994.
138
References
[144] H. Sidenbladh and M. J. Black. Learning image statistics for bayesian tracking. In Proc. Int. Conf. Computer Vision, 2001.
[145] H. Sidenbladh, M. J. Black, and L. Sigal. Implicit probabilistic models of human motion for synthesis and tracking. In Proc. European Conf. Computer Vision, 2002.
[146] R. A. Singer. Estimating optimal tracking filter performance for manned maneuvering targets. IEEE Trans. Aerospace and Electronic Systems, AES-6(4):473–483, 1970.
[147] R. A. Singer and K. W. Benhke. Real-time tracking filter evaluation and selection for tactical applications. IEEE Trans. Aerospace and Electronic Systems, AES-7(1):100 – 110, 1971.
[148] M. Sonka, V. Hlavac, and R. Boyle. Image Processing, Analysis, and Machine Learning. PWS Publishing, 1999.
[149] H. W. Sorenson. Bayesian analysis of time series and dynamic models, chapter Recursive estimation for nonlinear dynamic systems. Dekker, 1988.
[150] H. W. Sorenson and D. L. Alspach. Recursive Bayesian estimation using Gaussian sums. Automatica, 7(4):465–479, 1971.
[151] F. R. Stengel. Optimal Control and Estimation. Dover publications, inc., NY, 1994.
[152] B. Stenger, A. Thayananthan, P. H. S. Torr, and R. Cipolla. Model-based hand tracking using a hierarchical bayesian filter. IEEE Trans. Patter. Anal. Mach. Intell., 28(9):1372– 1384, 2006.
[153] M. J. Swain and D. H. Ballard. Colour indexing. Int. J. Comput. Vision, 7(1):11–32, November 1991.
[154] I. J. Taneja. Refinement inequalities among symmetric divergence measures. Australian J. Math. Analysis and Applications, 2(1):1–23, 2005.
[155] H. Tanizaki and R. S. Mariano. Nonlinear filters based on taylor series expansion. Commu. Statist. Theory and Methods, 25(6):1261–1282, 1996.
References
139
[156] O. Tuzel, F. Porkili, and P. Meer. Region covariance: A fast descriptor for detection and classification. In Proc. European Conf. Computer Vision, 2006.
[157] R. Urtasun, D Fleet, and P. Fua. 3d people tracking with gaussian process dynamic models. In Proc. Conf. Comp. Vis. Pattern Recognition, 2006.
[158] A. Utsumi and N. Tetsutani. Human detection using geometrical pixel value structures. In Intl. Conference on Automatic Face and Gesture Recognition, pages 34–39, 2002.
[159] R. van der Merwe. Sigma-point kalman filters for probabilistic inference in dynamic state-space models. PhD thesis, OGI School of Science & Engineering, Oregon Health & Science University, 2004.
[160] R. van der Merwe and E. Wan. Gaussian mixture sigma-point particle filters for sequential probabilistic inference in dynamic state-space models. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, volume 6, pages 701–704, 2003.
[161] R. van der Merwe and E. Wan. Sigma-point kalman filters for probabilistic inference in dynamic state-space models. In Workshop on Advances in Machine Learning, 2003.
[162] J. Vermaak, A. Doucet, and P. P´erez. Maintaining multi-modality through mixture tracking. In Proc. Int. Conf. Computer Vision, volume 1, pages 110–116, 2003.
[163] P. Viola, M. J. Jones, and D. Sow. Detecting pedestrians using patterns of motion and appearance. In Proc. Int. Conf. Computer Vision, volume 2, pages 734–741, 2003.
[164] Q. H. Vuong and W. Wang. Minimum chi-square estimation and tests for model selection. J. of Econometrics, 56(1-2):141–168, 1993.
[165] M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman & Hall/CRC, 1995.
[166] A. H. Wang and R. L. Klein. Optimal quadrature formula nonlinear estimators. Inform. Sci., 16(3):169–184, 1978.
140
References
[167] H. Wang, D. Suter, and K. Schindler. Effective appearance model and similarity measure for particle filtering and visual tracking. In Proc. European Conf. Computer Vision, volume 3851, pages 328–337, May 2006.
[168] H. Wang, D. Suter, K. Schindler, and C. Shen. Adaptive object tracking based on an effective appearance filter. IEEE Trans. Patter. Anal. Mach. Intell., 29(9):1661–1667, 2007.
[169] C. R. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Realtime tracking of the human body. IEEE Trans. Patter. Anal. Mach. Intell., 19(7):780–785, 1997.
[170] T. Xiang and S. Gong. Model selection for unsupervised learning of visual context. Int. J. Comput. Vision, 69(2):181–201, 2006.
[171] Ming Xu, J. Orwell, and G. Jones. Tracking football players with multiple cameras. In Proc. IEEE Int. Conf. Image Processing, volume 5, pages 2909– 2912, 2004.
[172] C. Yang, R. Duraiswami, and L. Davis. Fast multiple object tracking via a hierarchical particle filter. In Proc. Int. Conf. Computer Vision, pages 212 – 219, 2005.
[173] C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime tv-l1 optical flow. In Ann. Symp. German Association Patt. Recogn., pages 214–223, 2007.
[174] L. Zhao and C. Thorpe. Stereo and neural network-based pedestrian detection. IEEE Trans. Patter. Anal. Mach. Intell., 1(3):148 –154, 2000.
[175] T. Zhao and R. Nevatia. Tracking multiple humans in crowded environment. In Proc. Conf. Comp. Vis. Pattern Recognition, volume 2, pages 406–413, 2004.
[176] H. Zhou and K. S. P. Kumar. A “current” statistical model and adaptive algorithm for estimating maneuvering targets. AIAA J. of Guidance, 7(5):596–602, 1984.
[177] Q. Zhu, S. Avidan, M. C. Yeh, and K. T. Cheng. Fast human detection using a cascade of histograms of oriented gradients. In Proc. Conf. Comp. Vis. Pattern Recognition, volume 2, pages 1491 – 1498, 2006.
Appendix A
AIC-based model selection
In Section 3.2, a model-selection approach was used to determine the likelihood function of the probabilistic measure of presence which serves to probabilistically locate the target in the image. We give here a brief discussion of the approach which was used and a more detailed note of the obtained results.
When dealing with a model selection, we typically have a set of models and we want to determine which model fits best the data according to some fitness criterion. In other words, we wish to minimize the loss of information when modelling data using a particular model. A number of model-selection approaches have been proposed in the literature. Among the widely used are the Akaike information criterion (AIC) [3], the Bayes information criterion (BIC) [139], the Chi-squared test [164], the minimum description length (MDL) [135] and the corrected-likelihood AIC (CLAIC) [170]. We have chosen the AIC as the preferable criterion, since it combines the maximum likelihood principle and Kullback-Leibler information [93], while maintaining the intuition of the Occam razor. We provide here only the basic concepts and notations of the model selection using AIC; for a detailed discussion see [29].
The AIC model selection is based on calculating the following term for each i-th model
AIC, = -2log(C( ^ i\V)) + 2Kt,                               (A.1)
where L((-)i\V) is the likelihood of the i-th model given the data V, 0 is the maximum-likelihood estimate (MLE) of the model parameters and Kt is the number of parameters to be estimated in the model. The model comparison proceeds by calculating
?, = AICt - AICmin.,
141
142
Appendix
where AIC; is the AIC of the i-th model and AICmin is the minimum AIC value among all models. Then for each model, the Akaike weights wA1Ci are calculated
as
^AIC, = ---------5.                                                 (A.2)
3
These weights reflect the probability that the i-th model is the correct model among all the considered models.
A.1     Selection of the likelihood function
In Section 3.2, an AIC-based model selection was used to select the pdf function that best approximated the distribution of the presence measure (3.3) using approximately 115,000 values of the measure. We give a more detailed note on results here.
We have considered four models:
1.  Exponential pdf: p(x;b) = |e"f
2.   Gamma pdf: p(x; a, b) = ^y^e"?
3.  Inverse Gamma pdf: p(x;a,b) = ^'^e^
4.  Zero-mean Gaussian pdf: p(x;b) = Tj^e'^
 by 2tt
The parameters of each of the models were approximated by their maximum-likelihood estimates which were calculated from the data, i.e., the 115,000 values of the presence measure (3.3). Using these parameters we have calculated ?* (A.2) and wA1Ci (A.2). Figures A.1 show the empirical distributions the measure values in form of a histogram and superimposed are the four, maximum-likelihood estimated, models. It is already apparent from these figures that the Gamma pdf explains the measure values best among the four models. This can be further verified from the graphs of ?* and wA1Ci values in Figure A.2; the Akaike weight wAIC2 which corresponds to the Gamma function is practically one while all others are close to zero.
A.1 Selection of the likelihood function
143
0.06 0.05 0.04
0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
0.1          0.2         0.3         0.4         0.5         0.6         0.7         0.8         0.9           1                   0           0.1          0.2        0.3        0.4        0.5        0.6         0.7        0.8         0.9           1
0.1          0.2         0.3         0.4         0.5         0.6         0.7         0.8         0.9           1                  O           0.1          0.2         0.3         0.4        0.5        0.6         0.7        0.8        0.9           1
(c)
(d)
Figure A.1: The parameters of the four test models, exponential, gamma, inverse gamma, and a zero-mean Gaussian, were estimated from the test data using the maximum-likelihood approach. Graphs (a,b,c,d) show the empirical probability density function of the data in the form of a histogram and overlaid are the optimally fitted models: exponential (a), gamma (b), inverse gamma (c) and a zero-mean Gaussian (d).
0.08
0.07
0.03
0.02
0.01
144
Appendix
1E3			o
16			,'
14			i
12			1
10			
 8		'	
6		/	
4		1	
2	o._	„4	
III model number, i

11 /        \
/                                              \
i                                             \
t                                                       \
rt-----------1------------1------------1-----------h-----------1-----------0
III
model number, i
(a)
(b)
Figure A.2: The values ?i in (a) indicate that the optimal choice in terms of AIC criterion is the gamma distribution. This is further confirmed by the AIC weights in (b), where it can be seen that there is nearly a 100% chance that the gamma pdf explains the underlying data best among the four test models.
Based on the results form Figures A.1 and Figures A.2 the Gamma pdf was chosen to model the distribution of the presence-measure values.
1
f(z;ä,ß) =----xa    e
ß&T(ä)
(A.3)
where a in b the ML estimates
a = 1.769,    ß = 0.066, with 95% confidence intervals
p(l.719 < a < 1.818) = 0.95
V
V
X
and
p(0.064 < ß < 0.068) = 0.95.
Appendix B
Discretization of continuous-time models
The Bayesian probabilistic tracking scheme described in Section 2.5.5 is used to combine the information from the visual model and the target’s dynamics in order to arrive at better estimates of the target state (e.g., position). The dynamics of target are modelled by continuous processes. However, since the data, i.e., images, arrive at discrete time-steps, the continuous-time models have to be sampled at those time-steps so that we can use them in the Bayesian estimation. Following [151, 26] we provide here a derivation of the sampling process – the discretization. We first give a general derivation of the discretization. These solutions have been applied in Section 5.1 to yield a discretized counterpart of the continuous liberal dynamic model (5.9). In this thesis we also refer to two dynamic models, which are widely used in practice: the random-walk model and the nearly-constant-velocity dynamic model. For completion, we derive these in Appendix B.1 and Appendix B.2, respectively.
Let x(t) be a d-dimensional column vector describing the system state at time t and let the system dynamics be described by the following linear relation
x˙ (t)=Fx(t)+Gu(t)+Lw(t),                              (B.1)
where F is a d × d system matrix relating x(t) to its derivative x˙(t), G is a d × r matrix that relates an r-dimensional vector, the control input u(t), to x˙ (t), L is a d-dimensional vector and where w(t) is a white-noise forcing function. Now consider sampling this process at discrete times t0, t1, . . . , tk, . . . , such that the difference between each pair of consecutive times is ?t = tk+1 - tk. Assuming piece-wise constant inputs, it can be shown that the solution to the differential
145
146
Appendix
equation (B.l) at time tk (see for example, [151] p. 84) is
x(tfc) = *(At)x(tfc_i) + ru(tfc_i) + W(tfc_i)                    (B.2)
with
A                            l                       1
d>(At)   =   eFA* = I + F At + ^F2At2 + ^F3At3 +
tk r   =      f $(r)Gu(r)dr;
t* W(tfc_i)    =      / $(r)Lw(r)dr.                                                  (B.3)
where ifc = tfc_! + At, W(ifc_i) is the driven response at ifc due to the presence of the white-noise input during the (tk-i,tk) interval, and is itself a white-noise sequence. From now on, we will abbreviate notations of values at times tk by the subscripts (-)fc. Thus (B.2) may be written as
xfc = *(At)xfc_1 + rufc_1 + Wfc_i.                            (B.4)
To find the covariance matrix Qfc_! of the white-noise sequence Wfc_i, we have to evaluate the following expectation
Qfc_!   =   (Wfc-iWj^)
tk                                             tk
([      d>(L)Lw(L)dL][      $(r?)Lw(r?)dr?]T)
tk    tk
f   f $(e)L(w(e)wT(r?))LT$(r?)T^r?.               (B.5)
Since the continuous input disturbance w(i) is assumed a white-noise random process with a zero mean and a spectral density matrix Qc, we have
(w(L)wT(r?)) = Qcč(L-r?),                                 (B.6)
where 6(L - rj) is the Dirac-delta function. Finally, using (B.6), and assuming a time-invariant system, we can rewrite (B.5) into
At
Qfc-i = /$(e)LQcLT^(e)T^e                         (b.7)

_
B.1 Random-walk dynamic model
147
The above equations describe the discretization process which we will now apply to derive the random-walk and the nearly-constant-velocity model.
B.1    Random-walk dynamic model
The random-walk model assumes that changes in position arise purely due to a random factor. In other words, the system velocity is modelled by a white-noise sequence, i.e., x˙(t) = w(t). Thus dynamics of a one-dimensional system governed by the random-walk is described by the following continuous-time stochastic differential equation (s.d.e)
 o o                         o
±(t)=      °   °     x(i) +     l     w(t),                          (B.8)
 O   O                     O
where the system state x(i) is composed of position and velocity x(t) = [x(t),v(t)]T, and w(t) is a one-dimensional continuous white-noise sequence with spectral density qc. Following (B.1-B.7) the discretized s.d.e. is
xfc         $xfc_!+Wfc_i                                     (B.9)
r i on
?
01
where Wk-1 is a white noise sequence with covariance matrix (B.7)
 i o
 0   1
r At o i
0     0
Q = Qc    At   °     ,                                           (B.10)
0     0
Note that in (B.9) we write ? instead of ?(?t) for brevity and that we have dropped the subscript (·)k-1 in (B.10) since the covariance matrix Q depends only on ?t, that is, the difference between the consecutive time-steps.
B.2    Nearly-constant-velocity dynamic model
The nearly-constant-velocity model assumes that while the changes in the position arise due to a nonzero velocity, the changes in the velocity arise purely due to a random factor. In other words, the system acceleration is modelled by a white-noise sequence, i.e., v˙(t) = w(t).  Thus dynamics of a one-dimensional system
148
Appendix
governed by the nearly-constant velocity is described by the following continuous-time stochastic differential equation
 0   0                     1
x(i) =      0   1    x(t)+     0    w(t),                             (B.11)
 0   0                     1
where the system state x(i) is composed of position and velocity x(i) = [x(t), w(t)]T and w(i) is a one-dimensional continuous white-noise sequence with spectral density qc. Following (B.1-B.7) the discretized s.d.e. is
xfc         ^Xfc.i+Wfc.!                                   (B.12)
r 1   ?t "
?
01
where Wk-1 is a white noise sequence with covariance matrix (B.7)
 1 ?t
 0 1
\ |?t3   §?t2 1  |?t2     ?t
Q=qc     3     22          ,                                 (B.13)
?t
It is interesting to see that the covariance matrix Q implies correlation between current perturbation Wk-1 in position and velocity, even though w(t) in (B.13) is a one-dimensional white-noise process acting only on the target velocity at the time instant t, and is thus not correlated with any other time instant. Note that the correlation in the discretized form (B.13) arises because we are integrating the effects of the white-noise process over a nonzero interval ?t between consecutive time-steps.
Biography
Matej Kristan was born on October 30th, 1978 in Ljubljana. He finished a four-year secondary school programme at gimnazija Ledina and enrolled in the Faculty of Electrical Engineering, University of Ljubljana in 1997. Between years 2000 and 2003 he worked as a student with the Laboratory of Modelling, Simulation and Control on a robot-soccer project. The focus of his work was camera calibration and tracking. From year 2001 to 2002 he was employed by Zamisel d.o.o. where he worked as a development engineer on several projects. Results of his work were methods for automatic online adjustment of camera parameters, methods for rapid merging of 3D triangulated objects and automatic methods for calibration of laser CNC machines. In 2003 he received a diploma degree from the Faculty of Electrical Engineering, University of Ljubljana in the field of automatics – cybernetics. His thesis was on a system for tracking in robot soccer. In 2003 he also started his postgraduate study at the same faculty and joined the Laboratory of Imaging Technologies, where he worked from 2003 to 2005 on a commercial project of tracking in sports. In 2005 he received a master’s degree with thesis on application of sequential Monte Carlo methods to tracking in computer vision, and enrolled in a doctoral programme at Faculty of Electrical Engineering, University of Ljubljana. In 2006 he was employed as a researcher at the Visual Cognitive Systems Laboratory, Faculty of Computer and Information Science, University of Ljubljana. In 2007 he was also employed as a researcher at Machine Vision Group of prof. dr. Stanislav Kovaˇciˇc at the Faculty of Electrical Engineering, University of Ljubljana. Between years 2004 and 2008 he attended various international winter and summer schools on computer vision and statistical pattern recognition. Currently he is a researcher at the Visual Cognitive Systems Laboratory and the Machine Vision Group. Between years 2003 and 2008 he authored or coauthored six scientific journal papers, twenty conference papers, a technical report, a book chapter, a diploma thesis and a master’s thesis.
149
150
Biography
Personal data
First and last name Date of birth Birthplace Home address Citizenship
Matej Kristan
October 30, 1978
Ljubljana
Roˇzna Dolina c.VIII/29a, Ljubljana
Slovenian
Education
1997
1997 - 2003
September, 2003
2003 - 2005
May, 17-20, 2004
September, 2005
2005 - 2008
August, 25-29, 2007
March, 9-14, 2008
Finished a Secondary School gimnazija Ledina.
Study in a five-year university programme at University of
Ljubljana, Faculty of Electrical Engineering.
Graduated at Faculty of Electrical Engineering, University of
Ljubljana, Slovenia. Title of the thesis: Sledenje objektov v
robotskem nogometu.
Postgraduate  student,   Faculty   of   electrical   engineering,
University of Ljubljana, Slovenia.
Advanced School on Computer Vision, Pattern Recognition
and Image Processing,  Verona,  Italy.     Invited speaker:
Prof.    Brendan J. Frey, Course title:   Bayesian Networks
and Algorithms for Inference and Learning: Applications in
Computer Vision, Audio Processing, and Molecular Biology.
Reached a Master-of-Philosophy degree, Faculty of electrical
engineering, University of Ljubljana, Slovenia.  Title of the
thesis:   Sekvenˇcne Monte Carlo metode za sledenje oseb v
raˇcunalniˇskem vidu.
Doctoral   study   at   Faculty   for   Electrical   Engineering,
University of Ljubljana.
DIRAC/COSY    Summer    Workshop    on    Multi-Sensory
Modalities in Cognitive Science, Gerzensee, Switzerland.
VISIONTRAIN Thematic school on Understanding Behavior
from Video Sequences, Les Houches, France.
Biography
151
Professional and academic experience
2000 - 2003
2001 - 2002
2003 - 2005
2005 - 2006
2006 - 2008
2006 - 2007
Laboratory of Modelling, Simulation and Control, Faculty of Electrical Engineering, University of Ljubljana, Slovenia. Position: Student, Work: Spare-time work on a multi-agent management system in robot soccer. Areas of focus: tracking and camera calibration. Zamisel d.o.o, Ljubljana, Slovenija. Position: Development engineer, Work: Development of methods for automatic calibration of laser CNC machines. Methods for rapid merging of triangulated 3D objects. Automatic procedures for camera parameters optimization for visual inspection. Laboratory of imaging technologies, Faculty of Electrical Engineering, University of Ljubljana, Slovenia. Position: Researcher, Work: Development of a commercial application for motion analysis in team sports. Commissioned by: Inspireworks, NY, USA.
Laboratory of imaging technologies, Faculty of Electrical Engineering, University of Ljubljana, Slovenia. Position: Researcher, Work: Development of laser system for measuring dimensions of facade plates. Commissioned by: Trimo d.o.o, Ljubljana.
Visual Cognitive Systems Laboratory, Faculty of computer and information science, University of Ljubljana, Slovenia. Position: Researcher, Work: Development of cognitive systems for cognitive assistant, (EU project CoSy). Laboratory of imaging technologies, Faculty of Electrical Engineering, University of Ljubljana, Slovenia. Position: Researcher, Work: M2-0156 project CIVaBIS, sponsored by Ministry of Defence of Republic of Slovenia.
152
Biography
2007 - 2008 Laboratory of imaging technologies, Faculty of Electrical Engineering, University of Ljubljana, Slovenia. Position: Researcher, Work: technological project MIR, ”Autonomous vessel for measurement and logistics” – APSIS, sponsored by Ministry of Defence of Republic of Slovenia.
2007 - 2008 Laboratory of imaging technologies, Faculty of Electrical Engineering, University of Ljubljana, Slovenia. Position: Researcher, Work: M3-0233 project PDR, sponsored by Ministry of Defence of Republic of Slovenia.
Honors and awards
2005 Received a best student paper award at International Symposium on Image and Signal Processing and Analysis ISPA 2005.
Published work
1.01 Original Scientific Article
[1] M. Kristan, M. Perše, S. Kovačič, and J. Perš, Closed-world tracking of multiple interacting targets for indoor-sports applications, Computer Vision and Image Understanding (2008), in press.
[2] M. Perše, M. Kristan, G. Vučkovič, S. Kovačič, and J. Perš, A trajectory-based analysis of coordinated team activity in a basketball game, Computer Vision and Image Understanding (2008), in press.
[3] M. Kristan, M. Perše, S. Kovačič, and J. Perš, Sledenje veˇc igralcev v ˇsportnih igrah na podlagi vizualne informacije, Electrotechnical Review 74 (2007), no. 1-2, 19-24.
[4] M. Kristan, J. Perš, M. Perše, and S. Kovačič, A Bayes-spectral-entropy-based measure of camera focus using a discrete cosine transform, Pattern Recognition Letters 27 (2006), no. 13, 1419-1580.
[5] G. Klančar, M. Kristan, and R. Karba, Wide-angle camera distortions and non-uniform illumination in mobile robot tracking, Robotics and Autonomous Systems 46 (2004), no. 2, 125 - 133.
[6] G. Klančar, M. Kristan, and S. Kovačič, Robust and efficient vision system for group of cooperating mobile robots with application to soccer robots, ISA Transactions 43 (2004), no. 3, 329-342.
1.08 Published Conference Papers
[7] V. Sulic, J. Perš, M. Kristan, A. Jarc, M. Perše, and S. Kovačič, Izvedba algoritma raˇcunalniˇskega vida na omreˇzni kameri, Računalniška obdelava slik in njena uporaba v Sloveniji (P. Božidar, ed.), 2008, pp. 35-40.
153
154
Published work
[8] M. Kristan, D. Skočaj, and A. Leonardis, Incremental learning with Gaussian mixture models, Computer Vision Winter Workshop, 2008, pp. 25-32.
[9] M. Perše, M. Kristan, J. Perš, G. Vučkovič, and S. Kovačič, Temporal segmentation of group motion using Gaussian mixture models, Computer Vision Winter Workshop, 2008, pp. 47-54.
[10] D. Skočaj, M. Kristan, and A. Leonardis, Continuous learning of simple visual concepts using Incremental Kernel Density Estimation, International Conference on Computer Vision Theory and Applications, 2008, pp. 598-604.
[11] M. Kristan, J. Perš, A. Leonardis, and S. Kovačič, A hierarchical dynamic model for tracking in sports, Proceedings of the sixteen Electrotechnical and Computer Science Conference, ERK07, September 2007, pp. 187-190.
[12] D. Skočaj, A. Vrečko, M. Kristan, B. Ridge, G. Berginc, and A. Leonardis, Interaktiven sistem za kontinuirano uˇcenje vizualnih konceptov, Proceedings of the sixteen Electrotechnical and Computer Science Conference, ERK07, 2007, pp. 167-170.
[13] M. Perše, M. Kristan,J. Perš, S. Kovačič, Automatic evaluation of organized basketball activity, Computer Vision Winter Workshop 2007 (Helmut Grabner Michael Grabner, ed.), 2007, pp. 11-18.
[14] J. Perš, M. Kristan, M. Kolbe, and S. Kovačič, Tailgating detection using histograms of optical flow,Computer Vision Winter Workshop 2007 (Helmut Grabner Michael Grabner, ed.), 2007, pp. 19-26.
[15] M. Kristan, J. Perš, A. Leonardis, and S. Kovačič, Probabilistic tracking using optical flow to resolve color ambiguities, Computer Vision Winter Workshop 2007 (Helmut Grabner Michael Grabner, ed.), 2007, pp. 3-10.
[16] F. Erčulj, G. Vučkovič, J. Perš, M. Perše, and M. Kristan, Razlike v opravljeni poti in povpreˇcni hitrosti gibanja med razliˇcnimi tipi koˇsarkarjev, Zbornik naučnih i stručnih radova (N. Smajlovič, ed.), 2007, pp. 175-179.
[17] M. Kristan, J. Perš, M. Perše, and S. Kovačič, Towards fast and efficient methods for tracking players in sports, Proceedings of the ECCV Workshop on Computer Vision Based Analysis in Sport Environments (J. Perš and D. R. Magee, eds.), May 2006, pp. 14-25.
Published work
155
[18] M. Perˇse, M. Kristan, J. Perˇs, and S. Kovaˇciˇc, A template-based multi-player action recognition of the basketball game, Proceedings of the ECCV Workshop on Computer Vision Based Analysis in Sport Environments (J. Perˇs and D. R. Magee, eds.), May 2006, pp. 71-82.
[19] J. Perˇs, M. Kristan, M. Perˇse, M. Bon, G. Vuˇckovic, and S. Kovaˇciˇc, Analiza gibanja igralcev med tekmami, Raˇcunalniˇska obdelava slik in njena uporaba v Sloveniji (P. Boˇzidar, ed.), 2006, pp. 97-103.
[20] M. Kristan, J. Perˇs, M. Perˇse, M. Bon, and S. Kovaˇciˇc, Multiple interacting targets tracking with application to team sports, International Symposium on Image and Signal Processing and Analysis, September 2005, pp. 322-327.
[21] M. Perˇse, J. Perˇs, M. Kristan, G. Vuckovic, and S. Kovaˇciˇc, Physics-based modelling of human motion using kalman filter and collision avoidance algorithm, International Symposium on Image and Signal Processing and Analysis, September 2005, pp. 328-333.
[22] M. Kristan, J. Perˇs, M. Perˇse, and S. Kovaˇciˇc, Bayes spectral entropy-based measure of camera focus, Computer Vision Winter Workshop, February 2005, pp. 155-164.
[23] M. Kristan, J. Perˇs, M. Perˇse, and S. Kovaˇciˇc, Implementacija Condensation algoritma v domeni zaprtega sveta,Proceedings of the thirteenth Electrotechnical and Computer Science Conference, vol. B, September 2004, pp. 179-182.
[24] M. Kristan and F. Pernuˇs, Entropy based measure of camera focus, Proceedings of the thirteenth Electrotechnical and Computer Science Conference, vol. B, September 2004, pp. 179-182.
[25] J. Perˇs, M. Kristan, M. Perˇse, and S. Kovaˇciˇc, Observing human motion using far-infrared (FLIR) camera - some preliminary studies, Proceedings of the thirteenth Electrotechnical and Computer Science Conference, vol. B, September 2004, pp. 187-190.
[26] M. Perˇse, J. Perˇs, M. Kristan, and S. Kovaˇciˇc, Vrednotenje uˇcinkovitosti kalmanovega filtra pri sledenju ljudi, Proceedings of the thirteenth Electrotechnical and Computer Science Conference, vol. B, September 2004, pp. 191-194.
156
Published work
Technical reports
[27] M. Kristan, D. Skoˇcaj, and A. Leonardis, Approximating distributions through mixtures of Gaussians, LUVSS-TR-04/07, Faculty of Computer and Information Science, University of Ljubljana, September 2007.
Monographs and Other Completed Works 2.01 Scientific Monographs
[28] G. Klanˇcar, M. Lepetiˇc, M. Kristan, and R. Karba., Mobile robots : New research, ch. Vision System Design for Mobile Robot Tracking, pp. 117–141, Nova Science Publishers, 2006.
2.09 Masters Thesis
[29] M. Kristan, Sekvenˇcne Monte Carlo metode za sledenje oseb v raˇcunalniˇskem vidu, Masters thesis, Faculty of Electrical Engineering, University of Ljubljana, September 2005.
2.11 Diploma Thesis
[30] M. Kristan, Sledenje objektov v robotskem nogometu, Diploma thesis, Faculty of Electrical Engineering, University of Ljubljana, September 2003.
Izjava
Izjavljam, da sem doktorsko nalogo izdelal samostojno pod vodstvom mentorja prof. dr. Stanislava Kovačiča in somentorja prof. dr. Aleša Leonardisa. Izkazano pomoč drugih sodelavcev sem v celoti navedel v zahvali.
V Ljubljani, 16. maj 2008.
mag. Matej Kristan, univ. dipl. inž. el.
157