Univerza v Ljubljani Fakulteta za elektrotehniko Matej Kristan Sledenje ljudi v video posnetkih s pomoˇcjo verjetnostnih modelov DOKTORSKA DISERTACIJA Mentor: prof. dr. Stanislav Kovačič Somentor: prof. dr. Aleš Leonardis Ljubljana, 2008 University of Ljubljana Faculty of Electrical Engineering Matej Kristan Tracking people in video data using probabilistic models Ph.D. Thesis Supervisor: prof. Stanislav Kovačič, Ph.D. Cosupervisor: prof. Aleš Leonardis, Ph.D. Ljubljana, 2008 Abstract In this thesis we focus on probabilistic models for tracking persons in visual data. Tracking is defined within the context of probabilistic estimation, where the parameters of the target’s model are considered random variables and the aim is to estimate, recursively in time, the posterior probability density function over these parameters. The recursive estimation is approached within the established Bayesian framework of particle filtering. Several aspects of tracking persons are considered in this thesis: how to build a reliable visual model of a person, how to efficiently model the person’s dynamics and how to devise a scheme to track multiple persons. One of the essential parts of visual tracking is the visual model, which allows us to evaluate whether a person is located at a given position in the image. We propose a color-based visual model, which improves tracking in situations when the background color is similar to the color of the tracked person. The proposed color-based visual model applies a novel measure which incorporates the model of the background to determine whether the tracked target is positioned at a given location in the image. A probabilistic model of the novel measure was derived, which allows using the color-based visual model with the particle filter. The visual model does not require a very accurate model of the background, but merely a reasonable approximation of it. To increase robustness to the color of the background, a mask function is automatically generated by the visual model to mask out the pixels that are likely to belong to the background. A novel adaptation scheme is applied to adapt the visual model to the current appearance of the target. The experiments show that the proposed visual model can significantly improve tracking in situations when the color of the tracked person is similar to the background and can handle short-term occlusions between persons of different color. However, tracking still fails when a person gets in a close proximity of a visually similar object or when it is occluded by that object. The reason is that the ambiguity in the visual information is too large and cannot be resolved even with a good dynamic model. To better cope with the visual ambiguities associated with the color of the tracked person, we propose a combined visual model, which fuses the color information with the local motions in the person’s appearance. The local-motion feature is calculated from a sparse estimate of the optical flow, which is evaluated in images only at locations with enough texture. A probabilistic model of the local-motion is derived which accounts for the errors in the optical flow estimation as well as for the rapid changes in the target’s motion. The local-motion model is probabilistically combined with the color-based model into a combined visual model using an assumption that the color is conditionally independent of motion. An approach is also developed to allow adaptation of the local-motion model to the target’s motion. To better describe the dynamics of a moving person and improve estimation of person’s position and prediction, we propose a novel dynamic model, which we call the two-stage dynamic model, and the corresponding two-stage probabilistic tracker. The two-stage dynamic model is composed of a liberal and conservative dynamic model. The liberal model allows larger perturbations in the target’s dynamics and is used within the particle filter to efficiently explore the state space of the target’s parameters. This model is derived by modelling the target’s velocity with a non-zero-mean Gauss-Markov process and can explain well motions ranging from a complete random-walk to a nearly-constant-velocity. The conservative model imposes stronger restrictions on the target’s velocity and is used to estimate the mean value of the Gauss-Markov process in the liberal model, as well as for regularizing the estimated state from the particle filter. We give a detailed analysis of the parameters of the two-stage dynamic model, and also derive an approach to setting the spectral density of the liberal model. The proposed solutions for tracking a single person are extended to tracking multiple persons. A context-based scheme for tracking multiple targets from a bird’s-eye view is proposed, which simplifies the Bayes recursive filter for multiple targets and allows tracking with a lower computational complexity. In the context of observing the scene from a bird’s-eye view, the recorded images can be partitioned into regions, such that each region contains only a single target. This means that, given a known partitioning, the Bayes filter for tracking multiple targets can be simplified into multiple single-target trackers, each confined to the corresponding partition in the image. A parametric model of the partitions is developed, which requires specifying only the locations of the tracked targets. Since the partitions are not known prior to a given tracking iteration, a scheme is derived which iterates between estimating the targets’ positions and refining the partitions. Using this scheme we simultaneously estimate the locations of the targets in the image as well as the unknown partitioning. Key words: Computer vision; Probabilistic models; Tracking persons; Video data; Bayes recursive filter; Particle filters; Color-based model; Local motion; Two-stage dynamic model; Multiple targets Acknowledgements First of all I would like to express my sincere thanks to my supervisor, prof. dr. Stanislav Kovaˇciˇc, and my co-supervisor, prof. dr. Aleˇs Leonardis, who have guided me during my studies and have provided a vibrating environment for my research. A big thank you to Janez Perˇs for his guidance in the early stages of my postgraduate studies and discussions which have broadened my horizon in the field of computer vision. To Igor, Tanja and Katja, thanks for the encouragement, belief, and everything else that comes with a worm, supporting, family. Thank you Urˇsa for sticking with me from my diploma thesis, through masters right up to the doctoral thesis. You are the main reason why the Slovenian parts of those theses appear to be written in proper Slovene and with proper punctuation. I also thank my colleagues at the Machine Vision Group, the Visual Cognitive Systems group, and the Laboratory of Metrology and Quality, many of whom have offered interesting discussions and comments on mine and their professional work. I especially thank Miha Hiti and Barry Ridge for proofreading parts of my thesis. Thanks also to all my friends who have been a great and relaxing company, and who are too numerous to be listed here – you know who you are. I do not thank, however, to MAXDATA, who installed a shady disk in my laptop which almost caused me to lose my thesis – I thank the Norton Ghost for never having to think about that dreadful event again. And I thank Igor for helping me out the last time the damn laptop crashed. Matej Kristan Ljubljana May 2008 Contents Povzetek xi Opis oˇzjega znanstvenega podroˇcja .................... xii Vizualne znaˇcilnice .......................... xiv Dinamiˇcni modeli ........................... xviii Metode za ohranjanje identitet veˇcih tarˇc .............. xix Izvirni znanstveni prispevki ........................ xxi Podrobnejˇsi pregled vsebine ........................ xxii 1 Introduction 1 1.1 Related work ............................. 3 1.1.1 Visual cues .......................... 3 1.1.2 Dynamic models ....................... 9 1.1.3 Managing multiple targets .................. 11 1.2 Contributions ............................. 12 1.3 Thesis outline and summary ..................... 14 2 Recursive Bayesianfiltering 17 2.1 Tracking as stochastic estimation .................. 18 2.2 Recursive solution ........................... 19 2.3 Bayes filter for a stochastic dynamic system ............ 21 2.4 Historical approaches to recursive filtering ............. 23 2.5 Monte-Carlo-based recursive filtering ................ 26 vii 2.5.1 Perfect Monte Carlo sampling ................ 26 2.5.2 Importance sampling ..................... 28 2.5.3 Sequential importance sampling ............... 29 2.5.4 Degeneracy of the SIS algorithm ............... 31 2.5.5 Particle filters ......................... 34 3 Color-based tracking 41 3.1 Color histograms ........................... 42 3.1.1 Color-based measure of presence ............... 43 3.2 The likelihood function ........................ 45 3.3 The background mask function ................... 47 3.3.1 Implementation of dynamic threshold estimation ...... 48 3.4 Adaptation of the visual model ................... 49 3.5 Color-based probabilistic tracker ................... 52 3.6 Experiments .............................. 54 3.7 Conclusion ............................... 58 4 Combined visual model 61 4.1 Optical flow .............................. 62 4.1.1 Calculating the sparse optical flow ............. 65 4.2 Optical-flow-based local-motion feature ............... 68 4.2.1 Local-motion likelihood .................... 68 4.2.2 Adaptation of the local-motion feature ........... 69 4.3 The combined probabilistic visual model .............. 70 4.4 Experiments .............................. 70 4.5 Conclusion ............................... 76 5 Atwo-stage dynamic model 79 5.1 The liberal dynamic model ...................... 80 viii 5.1.1 Parameter ß .......................... 83 5.1.2 Selecting the spectral density ................ 86 5.2 The conservative dynamic model .................. 87 5.3 A two-stage dynamic model ..................... 89 5.4 Experimental study .......................... 92 5.4.1 Experiment 1: Tracking entire persons ........... 92 5.4.2 Experiment 2: Tracking person’s hands ........... 99 5.5 Conclusion ............................... 101 6 Tracking multiple interacting targets 103 6.1 Using the physical context ...................... 104 6.2 Parametric model of partitions .................... 105 6.3 Context-based multiple target tracking ............... 106 6.4 Experimental study .......................... 108 6.4.1 Description of the recordings ................. 108 6.4.2 Results ............................. 111 6.5 Conclusion ............................... 116 7 Conclusion 119 7.1 Summary of contributions ...................... 122 7.2 Future work .............................. 123 References 125 AppendixA 141 A.1 Selection of the likelihood function ................. 142 AppendixB 145 B.1 Random-walk dynamic model .................... 147 B.2 Nearly-constant-velocity dynamic model .............. 147 ix Biography 149 Published work 153 Izjava 157 X Povzetek Sledenje ljudi v video posnetkih s pomoˇcjo verjetnostnih modelov V disertaciji se ukvarjamo z verjetnostnimi modeli za sledenje oseb v video podatkih. Parametre modela osebe obravnavamo kot slučanje spremenljivke in sledenje definiramo v kontekstu statističnega ocenjevanja. Tako zastavljen problem potem rešujemo z rekurzivnim časovnim ocenjevanjem a posteriori funkcije porazdelitve gostote verjetnosti preko vrednosti parametrov. Rekurzivno ocenjevanje rešujmo z uveljavljenimi verjetnostnimi Bayesovimi metodami imenovanimi filtri z delci (angl., particle filters). Kadar sledimo s filtri z delci, je uspešnost sledenja močno odvisna od treh pomembnejših sestavnih delov sledilnika: Prvi sestavni del je verjetnostni vizualni model za lokalizacijo tarče v sliki s pomočjo njenih vizualnih lastnosti. Drugi sestavni del je verjetnostni dinamični model za opisovanje dinamike tarče. Ta model določa kako se parametri modela tarče spreminjajo skozi čas. Tretji sestavni del sledilnika je metoda za ohranjanje identitete več tarč, ki je še posebej pomemben kadar med tarčami prihaja do trkov. V disertaciji predlagamo izboljšave vseh treh sestavnih delov sledilnika. V nadaljevanju bomo najprej podali opis ožjega znanstvenega področja, nato bomo navedli prispevke k znanosti, sledil pa bo natančnejši pregled disertacije s poudarkom na prispevkih k znanosti. xi xii Povzetek Opis ožjega znanstvenega področja Sledenje ljudi v video posnetkih je del širšega področja računalniškega vida, s katerim se je v zadnjih dvajsetih letih ukvarjalo mnogo raziskovalcev. Rezultat teh raziskav je množica literature, katere preglede lahko najdemo v delih avtorjev kot so Aggarval in Cai [2], Gavrila [51], Gabriel et al. [49], Hu et al. [60] in Moeslund et al. [114, 115]. Sledenje z metodami računalniškega vida je našlo mesto v mnogih aplikacijah. Med njimi so: • Video nadzorovanje, kjer je namen slediti avtomobile in ljudi za detekcijo nenavadnega obnašanja. • Video editiranje, kjer je namen vključevanje grafičnih vsebin v video posnetkih preko gibajočih se objektov (oseb). • Analiza športnih iger na podlagi trajektorij pridobljenih s sledenjem igralcev med tekmo. • Sledenje laboratorijskih živali kot so insekti in glodalci, kjer je cilj raziskovati naravne več-agentne sisteme. • Vmesniki za komunikacijo človek-stroj v inteligentnih ambientih za pomoč pri človekovih vsakodnevnih opravilih. • Spoznavni sistemi, ki uporabljajo sledenje za učenje o dinamičnih lastnosti opazovanih objektov. Poglavitni problem sledenja v video posnetkih je negotovost, ki je povezana z vizualno informacijo, in negotovost v dinamiki sledenih objektov. Naraven način kako upoštevati te negotovosti je obravnavanje problema sledenja v kontekstu statističnega ocenjevanja stanja (npr. položaja) tarče skozi čas. Natančneje, znanje o trenutnemu stanju tarče predstavimo kot funkcijo gostote porazdelitve verjetnosti (angl. probability density function) (pdf) v prostoru stanj tarče. Sledenje tako obravnavamo kot problem rekurzivnega ocenjevanja a posteriori pdf tarče ob vsakem časovnem koraku z upoštevanjem trenutnih meritev. Ob predpostavki, da lahko dinamiko tarče in proces merjenja opišemo z linearnimi Gaussovimi procesi, lahko oceno a posteriori pdf izračunamo analitično preko znanega Kalmanovega filtra [76]. Predpostavke, ki jih naredi Kalmanov filter, so pogosto preveč idealizirane za vizualno sledenje, rezultat pa je poslabšano Povzetek xiii delovanje ali celo pogosto odpovedovanje sledilnika. Da bi lahko obravnavali bolj realne probleme, je bilo v literaturi predlagano mnogo izboljšav, vendar pa le-te niso bile sposobne modelirati povsem poljubnih porazdelitev, ki se lahko pojavijo v vizualnem sledenju. V poznih devetdesetih sta Isard in Blake [64] predstavila metodo imenovano algoritem Condensation za učinkovito računanje a posteriori pdf tarče, ki ni vsebovala tako omejujočih predpostavk kot Kalmanov filter. Ta metoda spada v širši razred sekvenčnih metod Monte Carlo, znanim pod skupnim imenom filtri z delci (angl. particle filters) [6, 43]. V nasprotju s Kalmanovim filtrom, filtri z delci ne predpostavljajo Gaussove a posteriori pdf tarče, pač pa predstavijo porazdelitev z diskretnim naborom vzorcev (delcev). Iteracija sledenja je tako sestavljena iz dveh korakov. V prvem koraku se simulira gibanje delcev preko predložne (angl. proposal) porazdelitve. Nato se v drugem koraku vsakemu delcu dodeli utež na podlagi dinamičnega modela in funkcije verjetja (angl. likelihood function). Predložna porazdelitev lahko služi kot vnos pomožne informacije za usmerjanje delcev v področja prostora stanj z večjo verjetnostjo. Pogosto pa dodatna pomožna informacija ni na voljo in v takih primerih lahko za predložno porazdelitev uporabimo kar dinamični model. Rezultat je opasan filter z delci (angl. bootstrap particle filter) [53], ki je tudi najbolj uporabljan med vsemi različicami teh filtrov. Učinkovitost filtra z delci je odvisna predvsem od sledečih delov: • Vizualne značilnice, ki so uporabljene za opisovanje vizualnih lastnosti tarče. • Dinamični model, ki opisuje dinamiko gibanja tarče. • Sistem za ohranjanje identitet tarč, kadar sledimo več kot eno tarčo. Ta disertacija se osredotoča na zgoraj navedene tri dele. Slednje obravnavamo v kontekstu verjetnostnega sledenja oseb v video podatkih. Glavni prispevki se nanašajo na verjetnostne modele vizualnih značilnic, verjetnostne dinamične modele in verjetnostne sheme za ohranjanje identitet pri sledenju več tarč. V nadaljevanju tega poglavja bomo najprej podali pregled literature z ožjega znanstvenega področja na katerega se nanašajo prispevki disertacije. xiv Povzetek Vizualne značilnice Z vizualnimi značilnicami modeliramo vizualno informacijo v slikah in jih med sledenjem uporabimo za lokalizacijo sledenih objektov. Glede na tip vizualne informacije, lahko te modele razdelimo v modele oblike, modele izgleda in na gibanju temelječe modele. Modeli oblike Eden od zgodnjih pristopov k modeliranju oblike so temeljili na fleksibilnih krivuljah ali kačah (angl. snakes) [148], ki so se iterativno prilegale robovom objekta. Glavna pomanjkljivost teh metod je bila njihova občutljivost na šum v podatkih. Zato so kače pogosto odpovedovale, npr. ko je med objekti prihajalo do zakrivanj ali kadar se je objekt nahajal na ozadju z mnogo robovi. Kadar imamo opravka z objekti, ki se po obliki bistveno ne razlikujejo, lahko uporabimo modele s porazdeljenimi točkami (angl. point distribution models) (PDM) [37]. Ti modeli so bili uspešno uporabljeni tako za modeliranje oblik uporov [36] kakor tudi pešcev [50]. Modeli PDM predpostavljajo, da lahko po obodu objektov, ki jih modeliramo, izberemo enolično množico točk. Iz velike množice tako označenih objektov lahko dobimo kompakten zapis objekta preko metode glavnih komponent (angl. principal component analysis) (PCA). Končni model je tako sestavljen iz podprostora točk, ki ga podpira majhno število dominantnih smeri variacije. Aktivni modeli oblike (angl. active shapes models) [19] ki temeljijo na B-zlepkih s kontrolnimi točkami razmeščenimi enakomerno po obodu objekta so bili uporabljeni za določanje pričakovane oblike pešca v aplikaciji vizualnega nadzorovanja [12], kakor tudi sledenja lista na grmovju [19]. Obliko aktivne konture so Blake et al. [19] omejili na specifične oblike objekta z določitvijo funkcije gostote verjetnosti preko prostora oblik. Eden od pomembnih parametrov aktivne konture, ki je v splošnem odvisen od objekta, je število kontrolnih točk za B-zlepke. Preveč točk lahko naredi model prekompleksen in nestabilen, medtem ko je rezultat modeliranja s premalo točkami lahko preveč poenostavljena oblika, ki ni primerna za sledenje. Da bi se izognili vplivu parametrizacije konture, so Malladi et. al [108] predlagali uporabo nivojskih množic (angl. level sets). Bistvo nivojskih funkcij je v tem, da eksplicitno modeliranje krivulje prevedemo v modeliranje višje dimenzionalne Povzetek XV nivojske funkcije. Rezultat te nivojske funkcije ob konstantnem nivoju je kontura. Ena prednost nivojskih funkcij pred aktivnimi konturami je njihova sposobnost modeliranja topolˇskih sprememb v obliki predmeta. Primer sledenja ljudi z nivojskimi funkcijami lahko najdemo v [38]. V primerih, ko so sledeni objekti opisani z majhnim ˇstevilom slikovnih elementov, ali se po obliki hitro spreminjajo, zgoraj opisani postopki niso primerni za njihovo predstavitev. Perˇs in Kovaˇciˇc [125] sta predlagala 14 binarnih, Walshovim funkcijam podobnih, jeder za robusten zapis oblike igralca rokometa med gibanjem po igriˇsˇcu. Jedra sta uporabila za zapis igralca v enem ˇcasovnem koraku ter za lokalizacijo istega igralca v naslednjem koraku. Needham [116] je predlagal pet v naprej nauˇcenih veˇc-resolucijskih jeder za opis igralcev nogometa. Jedra je doloˇcil iz velike mnoˇzice v naprej segmentiranih binarnih slik igralcev. Dimitrijeviˇc et al. [42] so uporabili posebno opremo za zajem gibanja in pridobili veliko mnoˇzico sekvenc oblik ljudi med hojo. Iz sekvence oblik so doloˇcili predloge za detekcijo kljuˇcnih poz ljudi med hojo. Kljuˇcna poza je bila doloˇcena kot tista poza, ko ima oseba obe nogi na tleh in je kot med nogami najveˇcji. Robustnost detekcije so izboljˇsali z upoˇstevanjem veˇcih zaporednih predlog. Dalal in Triggs [40] sta predstavila postopek, kjer sta obliko ljudi zapisala s histogrami orientiranih gradientov (angl. histograms of oriented gradients) (HOG). Sliko sta najprej razdelila v manjˇse celice in za vsako celico zgradila 1D histogram smeri gradientov. Ti gradienti so sluˇzili kot znaˇcilnice za predstavitev vsebine znotraj poljubnega pravokotnega podroˇcja. Metoda podpornih vektorjev (angl. support vector machine)(SVM) je uporabljena za ugotavljanje ali se znotraj nekega pravokotnega podroˇcja nahaja oseba. Lu in Little [102] sta uporabila HOGe za sledenje in detekcijo akcij igralcev hokeja. Izgled vsakega igralca posebej sta zapisala s svojim HOGom in uporabila filter z delci za generiranje novih moˇznih poloˇzajev igralcev v naslednji sliki. Sledene igralce sta poiskala v novi sliki preko primerjanja referenˇcnih HOGov s tistimi, ki sta jih izraˇcunala na generiranih poloˇzajih. Zhao in Thorpe [174] sta predlagala uporabo gradientov izraˇcunanih iz silhuet pridobljenih iz globinskih slik. Avtorja sta uporabila nevronsko mreˇzo za verifikacijo, ˇce neka silhueta pripada ˇcloveku. Ena od slabih strani na obliki temeljeˇcih modelov je, da ne upoˇstevajo barve objektov. Zato ti modeli ne morejo slediti objektov v prisotnosti drugih objektov istega razreda, ˇcetudi so slednji razliˇcnih barv. Problem modelov, ki eksplicitno opisujejo obliko objekta, je tudi v njihovi gradnji. Precej pozornosti je namreˇc xvi Povzetek treba nameniti dejstvu, da je za primeren model potrebno čim bolj zaobjeti variabilnost razreda oblik, ki jim želimo slediti. Poleg tega zajemanje oblik zahteva uporabo specializirane programske in strojne opreme. Modeli izgleda Zgodnji pristopi k sledenju na podlagi izgleda so temeljili na tako imenovanih barvnih predlogah [62, 142]. Barvne predloge predstavijo sledeni objekt s pravokotno matriko slikovnih elementov in funkcijo maskiranja, ki določa kateri elementi pripadajo objektu in kateri ne. Predloga se izračuna na znanem položaju objekta v eni sliki in se uporabi za njegovo lokalizacijo v naslednji sliki. Senior [141] je uporabil nekoliko kompleksnejši adaptivni statistični model izgleda za sledenje v vizualnem nadzorovanju. Tudi ta pristop temelji na predstavitvi sledenega objekta s pravokotno matriko elementov, razlika pa je v tem, da se barva vsakega elementa modelira z Gaussovo porazdelitvijo. Na podoben način obravnavajo izračun funkcije maskiranja. Lim et al. [100] modelirajo izgled ljudi s pravokotnimi regijami in hkrati modelirajo dinamiko sprememb izgleda. To dosežejo s projekcijo slikovnih elementov znotraj regije v nizkodimenzionalni podprostor preko algoritma nelinearne lokalno linearne podpore (angl. local linear embedding). V tem podprostoru se nato naučijo dinamičnega modela izgleda človeka med hojo. Jepson et al. [68] se lotevajo problema spreminjajočega izgleda z modeliranjem izgleda s tremi komponentami: počasi spreminjajoče, hitro spreminjajoče in šumne komponente. Med sledenjem vse tri komponente sproti prilagajajo trenutnim spremembam z algoritmom za maksimizacijo pričakovanih vrednosti (angl. expectation maximization) (EM). Utsumi and Tetsutani [158] uporabljata a priori znanje o izgledu za detekcijo ljudi v slikah. Vhodno sliko razdelita v manjše celice in primerjata variance ter srednje vrednosti svetlosti med bližnjimi celicami. Detekcija ljudi temelji na predpostavki, da se te vrednosti malo spreminjajo med sosednjimi celicami v slikah, ki vsebujejo ljudi, in bolj v slikah brez ljudi. V aplikaciji sledenja v športu Ok et al. [119] predpostavljajo, da lahko igralca kompaktno opišemo z dvema barvama: barvo majice in barvo hlač. Avtorji zato igralca razdelijo v dve regiji in vsako regijo opišejo z njeno povprečno barvo. Wren et al. [169] so predstavili sistem Pfinder, ki segmentira človeka v skupino mehurčkov (angl. blobs) in vsak mehurček opiše z elipso in povprečno barvo. Vendar ta sistem deluje le v precej kontroliranih pogojih in kadar se v prostoru nahaja zgolj ena Povzetek xvii oseba. Robustnejˇsi pristop je uporaba specializiranih detektorjev za detekcijo posameznih delov telesa [113, 131]. Te detekcije se lahko nato s pomoˇcjo znane topologije telesa uporabijo za izgradnjo statistiˇcnega modela za detekcijo ljudi v slikah. Slabost teh pristopov je v tem, da realna okolja vsebujejo mnogo okonˇcinam podobnih struktur, kar moˇcno poveˇca nezanesljivost detekcije. Pogosto uporabljen pristop k modeliranju barvnega izgleda so barvni histogrami [153]. Slednji so bili pogosto uporabljeni v aplikacijah vizualnega sledenja [60, 123, 118, 162, 35, 120, 34, 128]. Birchfield in Rangarajan [14] sta predlagala razred barvnih histogramov, ki vsebuje tudi prostorsko informacijo o barvi. To doseˇzeta z beleˇzenjem prostorske informacije o barvah posameznih celic v histogramu. Drugi, precej popularen pristop k modeliranju izgleda, je uporaba parametriˇcnih modelov kot so meˇsanice Gaussov (angl. mixtures of Gaussians) (MoG) [112, 78, 80, 172]. Nedavno so Tuzel et al. [156] predstavili kovarianˇcni zapis modela izgleda. V njihovem pristopu vsak slikovni element v pravokotni regiji, ki opisuje objekt, predstavijo z naborom znaˇcilnic. Te znaˇcilnice so lahko svetlostne, gradientne, itd. Model izgleda se nato zgradi preko kovarianˇcne matrike znaˇcilnic izraˇcunanih preko vseh slikovnih elementov objekta. Objekte detektirajo s primerjanjem kovariance v dani regiji z referenˇcno kovarianco. V ta namen uporabljajo razdaljo, ki temelji na posploˇsenih lastnih vrednostih. Vizualni modeli, ki smo jih opisovali do sedaj, v glavnem temeljijo na oblikah, barvi ali svetlostnih gradientih sledenega objekta v sliki. Ker ti modeli neposredno kodirajo informacijo o svetlostih slikovnih elementov, ne morejo dobro razlikovati med vizualno podobnimi objekti, kadar se ti gibljejo blizu ali se zakrivajo. Drugaˇcen pristop je torej uporaba znaˇcilnice, ki ne opisuje neposredno svetlostne informacije. Taka znaˇcilnica je npr. gibanje slikovnih elementov. Modeli temelječi na gibanju Sidenbladh in Black [144] sta predstavila metodo, ki upošteva odzive različnih filtrov in se uči statistike gibanja ter izgleda iz velike množice primerov delov telesa. To metodo uporabljata za določanje človeške poze. Primer sledenja, ki temelji popolnoma na optičnem toku, sta predstavila Du in Piater [44]. Avtorja uporabljata Kanade-Lucas-Tomasijev sledilnik točk [103] v filtru z delci. Tarče identificirata v vsaki sliki z rojenjem podobnih optičnih tokov. Podoben pristop sta uporabila Pundlik in Birchfield [130], ki uporabljata kriterij afine xviii Povzetek konsistentnosti za rojenje vektorjev optičnega toka. Nedavno sta Brostow in Cipolla [25] predlagala metodo, ki uporablja optični tok za izločanje stabilnih trajektorij točk v zaporedju slik. Slednje nato rojijo z metodo najkrajšega zapisa (angl. minimum description length) (MDL), rezultat pa so neodvisno gibajoča se telesa. Vse zgornje metode uporabljajo rojenje optičnega toka za določanje objektov v slikah. Zaradi tega opisane metode ne morejo vzdrževati pravih identitet objektov kadar se slednji zakrivajo - četudi so objekti različnih barv. Dinamični modeli Medtem, ko vizualni modeli opisujejo vizualne značilnosti sledenih objektov, dinamični modeli opisujejo njihovo gibanje. Znanje o dinamiki gibanja objekta lahko močno zmanjša prostor možnih vrednosti parametrov stanja objekta, ki jih je potrebno ocenjevati med sledenjem. To lahko pomaga pri razreševanju dvoumnosti vizualnih podatkov, in lahko zmanjša računsko kompleksnost sledenja. Predvsem zaradi naštetih razlogov so dinamični modeli pogosto uporabljani pri ocenjevanju človeške poze med gibanjem. Sidenbladh et al. [145] uporabljajo močan a priori model hoje za ocenjevanje možnih smeri gibanja sledene osebe. Modela gibanja se naučijo iz velike baze označenih primerov. Agarwal in Triggs [1] uporabljata nabor modelov drugega reda za sledenje artikuliranega gibanja ljudi med hojo in tekom. Urtasun et al. [157] uporabljajo metodo skritih spremenljivk skaliranih Gaussovih procesov (angl. scaled Gaussian process latent variable) z vgrajeno dinamiko za učenje nizko dimenzionalne podpore v prostoru poz igralca golfa med zamahom in prostoru oblik človeka med hojo. Nekateri avtorji so predlagali uporabo večih povezanih modelov (angl. interacting multiple models) (IMM) za opisovanje različnih tipov dinamike gibanja objektov. Ta pristop temelji na uporabi večih sledilnikov hkrati, kjer vsak sledilnik uporablja drugačen dinamični model za sledenje istega objekta. S posebnim postopkom, ki določa kako dobro vsak od modelov opisuje trenutno gibanje objekta, se rezultati sledenja posameznih sledilnikov kombinirajo v skupno oceno stanja tarče [10]. Metode IMM, ki temeljijo na Kalmanovem filtru so bile predvsem uporabljane v radarskem sledenju letal [98, 9]. Primer aplikacije vodenja pogleda kamere najdemo v [23]. Zaradi omejitev Kalmanovega Povzetek xix filtra so nekateri avtorji [111, 20] uporabili metode IMM v kombinaciji s filtri z delci. Slabost metod IMM je v tem, da se prostor verjetnosti precej poveča v primerjavi z metodami, ki uporabljajo zgolj en dinamični model, saj je potrebno ocenjevati gostoto porazdelitve verjetnosti preko vseh, ne le enega modela. Pri filtrih delci je potrebno izračunati vrednost funkcije verjetja za vsako generirano hipotezo (delec) posebej. To je v aplikacijah vizualnega sledenja navadno časovno-potratna operacija, saj je potrebno zgraditi vizualni model za vsak delec posebej in ga primerjati z referenčnim modelom. Časovna zahtevnost vizualnega sledenja s filtri z delci se tako znatno poveča ob uporabi metod IMM. V mnogih aplikacijah (npr. sledenje v športu, vizualni vmesniki človek-stroj za razpoznavanje gest, sledenje obraza in vizualno nadzorovanje) je težko določiti kompakten nabor pravil, katerim se podreja dinamika tarče. Zaradi tega in računske zahtevnosti metod IMM večina raziskovalcev uporablja zgolj en model za opisovanje dinamike. Klasična izbira je model naključnega prehoda (angl. random walk) (RW) ali model skoraj konstantne hitrosti (angl. nearly-constant velocity) (NCV). Dober opis teh modelov najdemo v [136]. Model RW najbolje opisuje gibanje tarče kadar slednja nenadoma spreminja smer gibanja ali stoji pri miru. Kadar pa se tarča giblje približno enakomerno v neki smeri (kar je značilno za aplikacije sledenja v športu in nadzorovanju), daje model RW slabe rezultate in gibanje bolje opišemo z modelom NCV. Torej, z namenom pokriti širši spekter gibanja, raziskovalci po navadi izberejo en model, RW ali NCV, in mu povečajo procesni šum. Vendar, če želimo doseči dovolj gosto pokritost prostora verjetnosti in s tem zadovoljivo sledenje, je potrebno povečati število delcev v filtru z delci. To poveča število potrebnih izračunov funkcije verjetja, kar upočasni sledenje. Metode za ohranjanje identitet več tarč Kadar sledimo več tarč naenkrat se pojavi netrivialen problem ohranjanja pravilne identitete posamezne tarče. Klasičen pristop v teoriji ocenjevanja in vodenja je detekcija vseh možnih kandidatov tarč ter asociacija detekcij s sledenimi tarčami. Standardni pristop k reševanju problema asociacije sta asociacija z najbližjim sosedom (angl. nearest neighbor) (NN) in verjetnostna hkratna asociacija (angl. joint probabilistic data association) (JPDA) [56]. Uporabo NN in JPDA filtrov na primerih sledenja v športu najdemo v [171, 66, 30] ter [77]. Zgodnejše primere uporabe JPDA filtrov v računalniškem vidu najdemo v [132, 138]. Vsi ti pristopi temeljijo na eksplicitni detekciji možnih tarč in XX Povzetek zahtevajo izčrpno naštevanje vseh možnih asociacij med tarčami ter detekcijami. To pripelje do problema s kompleksnostjo NP (angl. NP-complete). Nekateri avtorji zato poskušajo zmanjšati število možnih asociacij na vsakem koraku tako, da za vsako tarčo upoštevajo le najbližje detekcije (angl. gating) [171, 66, 30]. Hue et. al [61] obravnavajo vektor asociacij kot slučajno spremenljivko, katere trenutno vrednost določijo preko vzorčenja z Gibbsovim vzorčevalnikom. Drugačen pristop k reševanju problema sledenja večih tarč je obravnavanje stanj posameznih tarč kot eno samo skupno stanje. Tak pristop omogoča uporabo obstoječih rešitev v kontekstu filtrov z delci [123, 116]. Isard et al. [65] so predlagali razširiti skupno stanje z dodatno slučajno spremenljivko, ki predstavlja število opaženih tarč. Postopek so demonstrirali na primeru sledenja spreminjajočega se števila tarč. Ta pristop so uporabili Czyz et al. [39] za sledenje igralcev nogometa. Slabost metod, ki uporabljajo skupno stanje je v tem, da praviloma slaba ocena že ene od tarč pokvari celotno oceno vseh tarč. Zato je potrebno zelo povečati število delcev v filtru z delci, kar precej upočasni sledenja in zaradi česar je tak sledilnik primeren za sledenje le majhnega števila tarč [81]. Za reševanje tega problema so nekateri avtorji [175, 81] nedavno predlagali učinkovitejše sheme, ki temeljijo na metodah Monte Carlo z Markovimi verigami (angl. Markov Chain Monte Carlo) (MCMC). Vermaak et al. [162] so predstavili sledenje večih vizualno podobnih tarč kot problem ohranjanja modusov v a posteriori porazdelitvi preko vseh tarč. Ta pristop so kasneje uporabili Okuma et al. [120] in Cai et al. [30] za sledenje igralcev hokeja. Kadar poznamo število tarč, je preprosta rešitev kar sledenje vsake tarče s svojim sledilnikom. Tak pristop zmanjša kompleksnost problema, saj ni potrebno za ocenjevanje stanja ene tarče upoštevati tudi stanj vseh ostalih tarč. Vendar je tak pristop precej naiven, saj se pogosto zgodi, da po trku ali zakrivanju med podobnimi tarčami več sledilnikov sledi isto tarčo in sledenje odpove [81]. Za reševanje tega problema so različni avtorji predlagali metode kot so vzvratna projekcija s histogrami (angl. histogram back-projection) [142], metode z alarmi zakrivanja (angl. occlusion alarm probability) in metode s predlogami [34]. Kljub temu te metode odpovejo, kadar so si tarče vizualno podobne in se gibljejo ena ob drugi. Povzetek xxi Izvirni znanstveni prispevki V disertaciji smo se ukvarjali z razvojem verjetnostnih modelov za sledenje oseb v video podatkih. Raziskali smo različne verjetnostne modele vizualnih in dinamičnih lastnosti tarč, kakor tudi pristopov za sledenje več tarč s ciljem predlagati rešitve za izboljšavo sledenja, ki bistveno ne povečajo čas procesiranja in s tem ne upočasnijo sledenja. Izvirni prispevki k znanosti so sledeči: • Razvili smo na barvi temelječ vizualni model, ki izboljša sledenje, kadar je barva sledenega objekta podobna barvi ozadja. Predlagani vizualni model uporablja novo mero prisotnosti za detekcijo osebe v nekem položaju v sliki, ki upošteva model ozadja. Razvili smo novi verjetnostni model mere prisotnosti, ki omogoča uporabo mere prisotnosti v filtru z delci. Vizualni model ne zahteva zelo natančnega modela ozadja, temveč le približno oceno le-tega. Za povečanje robustnosti na barvo ozadja, vizualni model generira masko za izločevanje slikovnih elementov, ki z večjo verjetnostjo pripadajo ozadju. Predlagali smo tudi novo metodo za adaptacijo modela trenutnim vizualnim lastnostim tarče. • Predlagali smo sestavljeni vizualni model, ki združuje barvno informacijo z značilnostmi lokalnega gibanja, kar razreši probleme zakrivanja med vizualno podobnimi objekti. Značilnico lokalnega gibanja izračunamo iz redkega optičnega toka v točkah, ki imajo dovolj teksture. Razvili smo verjetnostni model lokalnega gibanja, ki upošteva tako možnost napake v oceni optičnega toka kot spremembe v smeri gibanja tarče. Lokalno gibanje smo združili z barvnim modelom v sestavljeni model s predpostavko, daje gibanje objekta pogojno neodvisno od njegove barve. Predlagali smo pristop s katerim se model lokalnega gibanja prilagaja gibanju tarče med sledenjem. • Predlagali smo dvostopenjski dinamičen model, ki združuje liberalni in konzervativni model za vernejše modeliranje gibanja tarče ter metodo za nastavitev parametrov liberalnega modela. Dvostopenjski dinamičen model je sestavljen iz dveh dinamičnih modelov: liberalnega in konzervativnega. Liberalni model dovoljuje velike spremembe v dinamiki gibanja tarče in je uporabljen v filtru z delci za učinkovito pokrivanje prostora stanj parametrov tarče. Model smo izpeljali z xxii Povzetek modeliranjem hitrosti z Gauss-Markovim procesom s srednjo vrednostjo različno od nič in je zato sposoben dobro opisovati vrsto različnih gibanj, od naključnih prehodov (angl. random walk) pa vse do skoraj konstantnih hitrosti (angl. nearly-constant velocity). Konzervativni model predpostavlja bolj stroge omejitve v hitrosti tarče. V sledilniku konzervativni model ocenjuje srednjo vrednost Gauss-Markovega procesa v liberalnem modelu in hkrati regularizira oceno stanja tarče iz filtra z delci. Izvedli smo analizo parametrov dinamičnega modela in predlagali praktično metodo za ocenjevanje spektralne gostote šuma v liberalnem modelu. • Predlagali smo na kontekstu temelječo metodo za sledenje večjega števila tarč ob linearni računski zahtevnosti. V kontekstu opazovanja scene s ptičje perspektive, lahko posneto sliko razdelimo v regije, tako da vsaka regija vsebuje le po eno tarčo. To pomeni, da se pri znani razdelitvi Bayesov filter za več tarč poenostavi v več sledilnikov za posamezne tarče, tako da vsak sledilnik omejimo na svoje področje v sliki. Predlagali smo parametričen model regij, ki zahteva določitev zgolj položajev sledenih objektov. Ker razdelitev ni znana pred iteracijo sledenja, smo razvili metodo ki iterira med ocenjevanjem položaja tarč in izboljševanjem ocene razdelitev. S to metodo hkratno ocenjujemo položaje tarč v sliki, kakor tudi ocenjujejmo neznano razdelitev slike. V nadaljevanju podajamo podrobnejši pregled vsebine doktorske disertacije s poudarkom na prispevkih k znanosti. Podrobnejši pregled vsebine V Poglavju 2 smo podrobno opisali verjetnostni okvir, imenovan flitri z delci (angl. particle filters), v katerem smo obravnavali problem sledenja. Najprej smo sledenje zastavili kot problem stohastičnega ocenjevanja in nato predstavili znano konceptualno rešitev, do katere pridemo z aplikacijo Bayesovega rekurzivnega filtra. Po kratkem pregledu zgodovinskih pristopov k rekurzivnem filtriranju smo pokazali kako lahko rešimo rekurzije z metodami Monte Carlo in rezultat so filtri z delci. Poglavje 3 je posvečeno razvoju barvnega vizualnega modela tarče, ki je eden od poglavitnih delov sledilnika, saj omogoča ocenjevanje ali se tarča Povzetek xxiii nahaja v nekem poloˇzaju v sliki. Barvni vizualni model smo izpeljali iz barvnih histogramov in predlagali izboljˇsave, ki se nanaˇsajo na sledenje z uporabo barvne informacije. Prva izboljˇsava je bila na barvi temeljeˇca mera prisotnosti. Predlagana mera prisotnosti uporablja oceno slike ozadja za zmanjˇsevanje vpliva ˇsuma v ozadju1. Z uporabo metode izbire modelov (angl. model selection) in metode najveˇcjega verjetja (angl. maximum likelihood) smo izpeljali funkcijo verjetja (angl. likelihood function), ki omogoˇca verjetnostno interpretacijo vrednosti mere podobnosti, kar omogoˇca integracijo v okvir verjetnostnega sledenja. Problem se pojavi kadar se tarˇca giblje po barvno podobnem ozadju, saj v tako skrajnih primerih mera prisotnosti ne razloˇcuje dovolj dobro med ozadjem in sledenim objektom. Zaradi tega vizualni model poskuˇsa oceniti masko za izloˇcanje slikovnih elementov, ki ne pripadajo tarˇci. Kadar se osvetljava scene spreminja, ali kadar se kamera trese, je ponavadi teˇzko pridobiti natanˇcen model ozadja. Zaradi tega smo se osredotoˇcili na uporabo zgolj preprostega modela in predlagali postopek za dinamiˇcno izloˇcanje ozadja. V naˇsem pristopu se maska generira posredno, preko ocene podobnosti sledenega objekta in ozadja ter se v tem smislu individualizira sledenemu objektu. Dodatna izboljˇsava je metoda za selektivno adaptacijo vizualnega modela, ki prepreˇcuje adaptacijo v primerih, ko je sledeni objekt zakrit ali je ocena njegovega poloˇzaja v sliki napaˇcna. Predlagali smo pristop kako vse te izboljˇsave verjetnostno povezati v sledilnik, ki temelji na filtru z delci. Rezultati eksperimentov so pokazali, da predlagane reˇsitve moˇcno izboljˇsajo sledenje v primerih, ko je sledeni objekt podoben ozadju in kadar prihaja do kratkotrajnih zakrivanj med vizualno podobnimi objekti. Vendar so eksperimenti tudi pokazali, da sledenje vseeno odpove, kadar se sledeni objekt pribliˇza ali se zakrije z barvno podobnim objektom. V Poglavju 4 predlagamo razˇsiritev barvnega modela z dodatnim modelom, ki ga imenujmo model lokalnega gibanja, v novi, sestavljeni vizualni model. Znaˇcilnico lokalnega gibanja izraˇcunamo preko optiˇcnega toka, ki ga ocenimo z algoritmom Lukas-Kanade. Algoritem Lukas-Kanade je sicer relativno hiter, vendar slabo ocenjuje optiˇcni tok v toˇckah kjer slika vsebuje le malo teksture. Zato najprej uporabimo Shi-Tomasijeve znaˇcilnice za doloˇcevanje podroˇcji z zadostno teksturo in izraˇcunamo optiˇcni tok le v teh toˇckah. Tako je znaˇcilnica lokalnega gibanja doloˇcena zgolj z uporabo redke (angl. sparse) reprezentacije 1Z besedno zvezo ”ˇsum ozadja” mislimo na slikovne elemente, ki so barvno podobni slikovnim elementom, ki pripadajo sledenemu objektu. xxiv Povzetek optiˇcnega toka v sliki. Da lahko upoˇstevamo moˇznost napake v oceni optiˇcnega toka in spremembe v gibanju tarˇce, smo razvili verjetnostni model lokalnega gibanja. Ker se model lokalnega gibanja moˇcno spreminja med gibanjem tarˇce, smo razvili metodo za prilagajanje modela, ki upoˇsteva oceno hitrosti sledenega objekta. Model lokalnega gibanja smo z verjetnostnimi pristopi zdruˇzili z barvnim modelom v sestavljen vizualni model tarˇce. Predlagali smo verjetnostni sledilnik, ki temelji na filtrih z delci in uporablja sestavljeni vizualni model za sledenje. Predlagani sledilnik smo preizkusili na primerih sledenja dlani in sledenja oseb v nadzorovanju ter ˇsportu. Rezultati eksperimentov so pokazali, da sestavljeni model uspeˇsno razreˇsuje zakrivanja med vizualno podobnimi objekti in omogoˇca izboljˇsano sledenje. V Poglavju 5 smo se osredotoˇcili ˇse na en zelo pomemben sestavni del verjetnostnega sledilnika – dinamiˇcni model tarˇce. Predlagali smo dvonivojski dinamiˇcni model in dvonivojski sledilnik, ki lahko upoˇsteva razliˇcne tipe dinamike gibanja. Dvonivojski model je sestavljen iz dveh dinamiˇcnih modelov: liberalnega in konzervativnega. Liberalni dinamiˇcni model smo izpeljali iz predpostavke, da lahko modeliramo hitrost objekta z Gauss-Markovim procesom s spremenljivo srednjo vrednostjo. Analiza parametrov liberalnega modela je pokazala, da sta dva popularna dinamiˇcna modela, model nakljuˇcnega prehoda (angl. random walk, RW) in model skoraj konstantne hitrosti (angl. nearly-constant velocity, NCV), zgolj posebni obliki liberalnega modela, ki nastopita pri limitnih vrednostih njegovih parametrov. Z izbiro parametrov med limitnimi vrednostmi, lahko liberalni dinamiˇcni model dobro opisuje dinamike, ki so med RW in NCV. Zelo pomemben parameter liberalnega dinamiˇcnega modela je spektralna gostota ˇsuma v Gauss-Markovem procesu. Ta je odvisna od dinamike znaˇcilne za razred sledenih objektov. Zato smo predlagali metodo za praktiˇcno doloˇcevanje spektralne gostote, ki zahteva poznavanje zgolj sploˇsnih lastnosti gibanja objekta. Drugi pomembni parameter liberalnega modela je srednja vrednost Gauss-Markovega procesa, saj omogoˇca nadaljno prilagoditev sledilnika dinamiki tarˇce. Za uˇcinkovito ocenjevanje te vrednosti med sledenjem uporabljamo konzervativni dinamiˇcni model v dvonivojskem sledilniku. V nasprotju z liberalnim modelom konzervativni model predpostavlja, da je trenutna hitrost objekta zgolj linearna kombinacija preteklih hitrosti in tako vsiljuje moˇcnejˇse omejitve hitrosti objekta. Predlagani dvonivojski dinamiˇcni model uporablja liberalni model znotraj filtra z delci za uˇcinkovito raziskovanje prostora stanj parametrov tarˇce. Po drugi Povzetek XXV strani dvonivojski model uporablja konzervativni dinamični model za oceno srednje vdernosti Gauss-Markovega procesa v liberalnem dinamičnem modelu in za regularizacijo ocen pridobljenih iz filtra z delci. Rezultati eksperimentov so pokazali, da v primerjavi s popularnima in pogosto uporabljenima dinamičnima modeloma dvonivojski model dosega natančnejše ocene stanj ob manjšem številu delcev v filtru z delci. To precej zmanjša čas, ki je potreben za procesiranje ene iteracije sledenja. V Poglavju 6 smo razširili predstavljene rešitve za sledenje posameznih tarč na sledenje več tarč. Osredotočili smo se na aplikacije, kjer je kamera postavljena tako, da na sceno gleda s ptičje perspektive in predlagali novo, na kontekstu temelječo, metodo za sledenje več tarč. V kontekstu opazovanja scene s ptičje perspektive smo izpeljali omejitve, ki poenostavijo problem sledenja več tarč. Te omejitve narekujejo, da lahko opazovano sceno razdelimo v nabor neprekrivajočih se regij, tako da vsaka regija vsebuje le po eno tarčo. Omejitve smo formalizirali s parametričnim modelom za razdelitev slike. V Bayesovem smislu deluje parametrični model kot latentna spremenljivka, ki pri znani vrednosti poenostavi Bayesov filter za več tarč in omogoča sledenje vsake tarče z lastnim sledilnikom. To močno zmanjša računsko kompleksnost problema sledenja več tarč. V Poglavju 5 predstavljeni dvonivojski dinamični model je uporabljen v filtru z delci posamezne tarče, kar naredi sledilnik še bolj učinkovit v smislu časa porabljenega za procesiranje ene iteracije sledenja. Predlagani na kontekstu temelječ sledilnik smo preizkusili na zahtevni bazi podatkov, ki je vsebovala posnetke tekem košarke in rokometa. Sledilnik smo primerjali z referenčnim sledilnikom, ki se je od predlaganega razlikoval le v tem, da ni uporabljal modela razdelitev slike (konteksta) in je bil zgolj nabor neodvisnih sledilnikov posameznih tarč. V vseh preizkusih je predlagani sledilnik močno zmanjšal število odpovedi v primerjavi z referenčnim sledilnikom in omogočal sledenje tudi v primerih, ko je med večimi igralci prišlo trkov ter prerivanj. Rezultati in prispevki doktorata so ponovno povzeti v Poglavju 7, kjer poudarimo prednosti ter slabosti predlaganih rešitev. V luči le-teh začrtamo smernice za nadaljne delo in možne izboljšave metod za sledenje oseb. xxvi Povzetek Although tracking itself is by and large a solved problem, ... Jianbo Shi and Carlo Tomasi, 1994 Chapter 1 Introduction Tracking people in video data is a part of a broad domain of computer vision that has received a great deal of attention from researchers over the last twenty years. This gave rise to a body of literature, of which surveys can be found in the work of Aggarval and Cai [2], Gavrila [51], Gabriel et al. [49], Hu et al. [60] and Moeslund et al. [114, 115]. Computer-vision-based tracking has found its place in many real world applications; among these are: • Visual surveillance, where the aim is to track people or traffic and detect unusual behaviors. • Video editing, where the aim is to add graphic content over a moving object or a person in a video recording. • Analysis of sport events to extract positional data of athletes during a part of the sports match. These data can be then used by sports experts to analyze the performance of athletes. • Tracking of laboratory animals such as insects and rodents with aim to studying interactions of natural multi-agent systems. • Human-computer interfaces used in the intelligent ambients which aim to assist people in their everyday tasks. • Cognitive systems, which can use tracking to learn about dynamic properties of different objects in their environment. A prominent problem in tracking from video are the inherent uncertainties associated with the visual data and the uncertainties associated with the 1 2 Introduction dynamics of the tracked objects. One way to account for these uncertainties is to consider the problem of tracking in the context of statistical estimation of the target’s state (e.g., position) over time. More precisely, the information of the current state of the target is presented as a probability density function in the target’s state space. Tracking is then posed as a problem of recursive estimation of the target’s posterior distribution at each time-step in light of new measurements. Under the assumption that the target’s dynamics and measurement process can be described by a linear, Gaussian, processes the estimation of the posterior can be calculated in a closed-form through the well-known Kalman filter [76]. However, the assumptions made by Kalman filter are usually too unrealistic for visual tracking and thus result in a degraded performance. Various extensions have been proposed over the years to account for more realistic models, however none of them could deal with the arbitrary forms of the target’s posterior. In late 90s Isard and Blake [64] presented a method called Condensation algorithm for efficiently calculating the posterior of the target, that does not require restrictions imposed by the Kalman filter. This method came from a general class of sequential Monte Carlo methods known as the particle filters [6, 43]. In contrast to Kalman filter, particle filters do not assume a Gaussian form of the target’s posterior, but rather present distributions by weighted sets of samples (particles). Each sample presents a realization of the target’s state, and tracking then proceeds in two steps. These steps involve simulating the samples using a proposal distribution and recalculating their weights using the target’s dynamic model and a likelihood function, which tells how likely each simulated state is, given the observation. The proposal distribution can serve as means of using auxiliary information to guide particles in more probable regions of the state space. When no such information is available, a common approach is to use the dynamic model as the proposal, which gives the widely-used bootstrap particle filter [53]. The efficiency of visual tracking with particle filters depends a great deal on the following subparts of the method: • Visual cues which are used to encode the visual properties of the tracked objects. • A dynamic model that describes the dynamics of the tracked object. • A multiple target management system for keeping track of the identities of multiple objects in cases when multiple targets are considered. 1.1 Related work 3 The three subparts listed above will be the focus of this thesis. We will consider them in the context of probabilistic tracking of persons in video data. The main contributions will concern probabilistic visual models, probabilistic dynamic models and probabilistic schemes for tracking multiple targets. The remainder of this section is structured as follows. In Section 1.1 we review the related work on visual cues, dynamic models and probabilistic approaches to tracking multiple targets. In Section 1.2 we give a detailed description of our contributions and in Section 1.3 we give the thesis outline. 1.1 Related work 1.1.1 Visual cues The visual cues incorporate the visual information which is extracted from the images and is used to encode the visual properties of the tracked objects. Based on the type of the visual information contained in these models we can divide them into the following three classes: shape-based models, appearance-based models and motion-based models. Shape models The early approaches to modelling shape used deformable lines, or snakes, [148] which were iteratively fitted to the features corresponding to the edges of the object. The main disadvantage of these methods was their sensitivity to noise and could not handle well situations, where the object was occluded by another object. Furthermore, those models were sensitive to the presence of spurious edges in the background. When we consider a class of objects with similar shapes, contour models such as point distribution models (PDM) [37] can be used. These have been successfully applied to modelling shapes of objects such as resistors [36] and have been demonstrated on an example of tracking pedestrians [50]. The PDMs are built from sets of examples of labelled points on the boundary of the object to be identified. A compact representation of the object is found through principal component analysis (PCA), by retaining a low-dimensional subspace spanned by the dominant modes of variation. 4 Introduction Active shape models [19] based on B-splines with equally-spaced control points around the object’s outline have been used to capture the expected shapes of pedestrians for visual surveillance in [12] and tracking leaves of bushes [19]. The shape space of active contours can be efficiently constrained to a set of plausible shapes by building a probability density function (pdf) over the parameters of the contour [19]. To avoid specific parametrization of the object’s contour, level sets [108] have been proposed. Level sets are based on translating the explicit modelling of the curve into modelling a higher-dimensional embedding function. A constraint is imposed on this embedding function to yield regions inside and outside of the shape/contour. One advantage of level sets over active contours is that the embedding function can handle well topological changes in shape such as splitting and merging. An example of using level sets for tracking silhouettes of humans in noisy images can be found in [38]. In cases when the tracked objects are small or change their shape rapidly, alternative shape features may be more appropriate. In application of tracking in sports, Perˇs and Kovaˇciˇc [125] encoded the players’ shapes by utilizing 14 binary Walsh-function-like kernels. The kernels were used to encode the shape of the target in the current time-step and used in the next time-step to yield the most likely position of that target. To capture the variability in shape of football players, Needham [116] encoded the shapes of the players using a set of five pre-learned multi-resolution kernels which were learned in a semi-supervised manner from hand-labelled binary images of the players. When the tracked objects occupy larger areas in the image, more detailed shape models can be applied. Dimitrijeviˇc et al. [42] used motion capture data to extract a large database of sequences of human shapes. These were used to detect key poses of walking humans in images by chamfer matching [121]. They defined the key pose as the pose when both feet of a person were on the ground and the angle between the legs was greatest. However, chamfer matching typically yields many false detections in real-life environments (e.g., [96, 42]). For that reason, the authors apply a temporal constraint by comparing three sequential frames with three sequential silhouettes in the template, and apply a statistical-relevance method to determine which parts of the silhouette are most significant for the task of detection. This methodology was extended in [48] to interpolate between detections and thus create trajectories of walking people. An implicit shape model was proposed by Liebe et al. [95] to detect pedestrians walking in parallel to the 1.1 Related work 5 image plane of the camera. Their approach uses a pre-learned codebook of patches extracted from pedestrians and applies a probabilistic Hough voting procedure. At the learning stage, patches are sampled from a set of pre-segmented images of pedestrians and a codebook of patches is generated. Then the extracted patches are revisited to create spatial occurrence distribution for the codebook; at that stage also the figure-ground map is recorded for each patch. In the recognition stage, candidate patches are extracted, matched to the codebook, and a spatial probability distribution of object locations is created. Detection is then carried out simply by detecting the modes in the location distribution. Dalal and Triggs [40] represented human shape by histograms of oriented gradients (HOG). They first divided an image into smaller cells, and for each cell a one-dimensional histogram of gradient directions was constructed. A support vector machine (SVM) was then used with these features to detect humans in rectangular regions in the image. Using a boosting approach, Zhou et al. [177] were able to speed up HOG-based detection up to nearly real-time. Lu and Little [102] adopted HOGs to track and detect actions of hockey players. The key difference was that they used a separate reference HOG model of each player and used a particle filter to generate a set of hypothesized locations of the players in a given time-step. HOGs were extracted from image at these hypothesized locations and then probabilistically compared to the reference HOGs to refine the hypotheses. Hotta [59] applied a bank of Gabor filters to detect edges in images and used the filtered images to detect and track faces. A face was encoded by dividing a predefined rectangular region into nonoverlapping blocks, and a SVM classifier was trained on each block separately using a database of presegmented faces. During a detection stage, the responses of these local classifiers were combined to classify the region into a face or a non-face. Zhao and Thorpe [174] calculated gradients from silhouettes of objects extracted from depth data. A neural network was then used on the calculated gradients to verify if a given silhouette originated from a human. One drawback of the shape-based visual models is that they do not take into account the color properties of the target. Thus these models can fail to maintain the identity of the object in presence of multiple other objects of the same shape class. When constructing models that explicitly model the object’s outline, great care must be taken to capture the variability of the entire class of objects we want 6 Introduction to track. Furthermore, the construction of these models may require specialized hardware. Appearance models The early approaches in color-based tracking [62, 142] utilized color templates, which were extracted at the estimated position of the target in one frame and used to localize the same target in the next frame. A more elaborate adaptive statistical model of object’s appearance was used by Senior [141] in application of visual surveillance. Each object was presented by a rectangular array of pixels and the color distribution of each pixel was then modelled by a single Gaussian. Along with that, a mask function was estimated online to determine which pixels in the rectangle correspond to the object and which do not. Lim et al. [100] also encoded humans by regions within rectangles and modelled the dynamics of changing appearance. This was achieved by projecting pixels inside of a rectangle to a low dimensional subspace using a nonlinear local-linear-embedding algorithm. The dynamics of the appearance of a walking human were learned in this subspace. Jepson et al. [68] tackle the problem of appearance changes by modelling the appearance by three components: a slowly changing, a rapidly changing and a noise component. They use the expectation maximization (EM) algorithm to update the components. Utsumi and Tetsutani [158] used a prior knowledge of appearance to detect humans in images. They partitioned the image into a number of cells and compared variances and mean values of intensities among proximal cells. Detection of humans was based on the assumption, that for the images with humans, the distances among the cells will be smaller than for images without humans. In application of sports tracking, Ok et al. [119] noted that the player can usually be described by two colors: the color of the shirt and the color of the shorts. Therefore, they divided each player into two separate regions and encoded each region by the mean value of the color within that region. Wren et al. [169] presented a system called Pfinder which was based on segmenting a human into a set of blobs and encoding each blob by an ellipse and its color. This approach, however, works only when a single person is in the scene, and requires a controlled environment. A more robust approach is to apply body-part detectors to identify locations of the body parts which can then be combined probabilistically to detect people [113, 131]. A drawback of this approach is 1.1 Related work 7 that its performance can deteriorate in the real-world images, since they usually contain many limb-like objects. An often used approach to modelling color-based appearance is application of color histograms [153]. The color histograms have been successfully applied in many applications of visual tracking [60, 123, 118, 162, 35, 120, 34, 128, 109, 7]. A common approach is to use a single histogram (eg. [123, 118, 162]) to encode the object’s appearance. Comaniciu and Meer [35] attempted to increase the robustness of tracking by considering also a histogram from a neighborhood of the tracked object to determine the salient components of the object’s appearance model. A similar approach was adopted by [7] to determine color salient regions on the object’s appearance. Some attempts to explicitly include spatial information into histograms were presented, eg., in [120, 109] where separate histograms were used to encode the upper and lower parts of person’s appearance. Birchfield and Rangarajan [14] proposed a class of color histograms that implicitly integrates the spatial information of the target’s color. This is done by keeping track of spatial statistics for colors of each bin in the color histogram. Another popular approach to modelling the appearance is using parametric models such as mixture of Gaussians (MoG) to model entire color distribution [112, 78, 80, 172] or to approximate only the dominant colors in the distribution [55]. Wang et al. [167, 168] extend MoGs by also considering spatial information and call these extended mixture models SMoGs. They also propose an EM-based algorithm to update SMoGs online. Recently, Tuzel et al. [156] introduced covariance-based descriptors of appearance. In their approach, each pixel in a rectangular region containing the object of interest is presented by a set of features. These features may be intensity values of color channels, gradients, etc. An appearance model is obtained by calculating the covariance matrix of the features over all pixels in the rectangle. This reference covariance is compared to the covariance in the candidate region by using a generalized eigen-value-based distance measure. Babu et al. [7] proposed an appearance model for tracking nonrigid objects that can be considered a combined between a color-template- and a color-histogram-based approach. The model is constructed by selecting small neighborhoods of pixels within the object’s bounding box. These neighborhoods are encoded by the color templates as well as color histograms. During tracking, the color templates are used to obtain a rough estimation of the object position in the current frame and then histograms are used to refine the position. 8 Introduction Many of the visual models described sofar are based on encoding some shape, color or gradient visual properties of the tracked objects. When a target is moving in a clutter, a single visual model may not be sufficient to discriminate the target from the background. For that reason several authors have proposed to track with combinations of these models. Li and Chaumette [97] combine shape, color, structure and edge information to improve tracking through varying lighting conditions and cluttered background. Similarly, Stenger et al. [152] and Wang et al. [168] combine color and edge features to make tracking robust to background clutter. Per´ez et. al. [124] propose to integrate sound cues with the visual cues to improve head tracking for specialized applications. Since all visual cues may not describe the target’s appearance equally well, Brasnett et al. [24] proposed a weighted scheme to combine edge, color and texture cues. Even though fusing several visual models may improve tracking, these models are still intensity-related and are prone to fail in situations when the target is located in a close proximity of another visually similar object. Thus, another approach is to utilize an appearance-independent cue such as the motion of pixels. Motion-based models Sidenbladh and Black [144] use filter responses to learn statistics of motion and appearance from a large number of training examples of different body parts for human pose estimation. Viola and Jones [163] improved pedestrian detection by learning a cascade of weak classifiers on manually extracted patches of differences between consecutive images. A probabilistic model of local differences in consecutive images was proposed by P´erez et al. [124]. They partition the image into an array of cells and assume that a cell contains motion if the differences in that cell are approximately uniformly distributed. A Parzen estimator [140] is then applied to produce a motion-based importance function, which is used within a particle filter to guide particles into the regions of the image which contain motion. A drawback of methods which rely on image differencing is that they are essentially local-change detectors and therefore cannot resolve situations when a target is occluded by a moving, visually similar, object. An obvious solution is thus to take into account the apparent motion in the images – the optical flow. Various bottom-up approaches have been proposed recently, which are based on clustering similar flows to yield moving objects. Gonzalez et al. [52] applied a Kanade-Lucas-Tomasi (KLT) feature tracker [103] 1.1 Related work 9 which used optical flow to track and cluster points on a human body. The robustness of tracker was increased by applying a radial-basis-function network to filter the optical flow. Another attempt to track solely by the optical flow was presented by Du and Piater [44]. In their approach a KLT feature tracker was implemented in the context of a mixture particle filter. Targets were identified in each frame by clustering similar optical flow features. A similar approach was used in [130], where the current flow vectors were clustered by region growing and pruning using affine motion consistency as a criterion. Recently, an approach was presented in [25] where the optical flow was used to extract stable trajectories of features. At each time-step they considered a temporal trajectory of each active feature for thirty frames forward and backward in time. These trajectories are first clustered into a large, predefined, number of clusters. A distance tree is then built among the clusters and a minimum-description-length method is applied to iteratively merge clusters into consistently moving objects. The same approach was adapted by [99] where the feature consistency criterion was formulated through potential functions among different flow trajectories. These potential functions considered motion coherence, spatial coherence as well as temporal inertia. The features were then clustered hierarchically and heuristics were used to decide when to stop clustering. Bugeau and P´erez [28] introduce the color information in the clustering stage and apply graph cuts to improve segmentation of the object from the background. Assuming that discontinuities in the optical flow occur at the boundaries of a moving object, Lucena et al. [105, 104] were able to track a moving person’s palm using a contour tracker, which was based on detecting these discontinuities. A drawback of the approaches which are based on clustering flow vectors is that, due to the clustering procedure and the nature of the optical flow data, they cannot maintain correct identities of the targets after full occlusion even if the targets are of different colors. Furthermore, those approaches that rely on the assumption that the target is always in motion are prone to failure when the target stops moving or moves significantly less than another visually similar object in the neighborhood of the target. 1.1.2 Dynamic models While the visual models are used to capture the visual properties for tracking objects, dynamic models are used to describe their dynamics, i.e., how the objects 10 Introduction are expected to move in the image. When dynamics of the tracked object are known, the search space of the parameters to be estimated during tracking can be constrained considerably. This aids to resolve ambiguities in the visual data as well as possibly reducing the processing time required for a single tracking iteration, as smaller portions of the parameter space need to be explored. In this respect, dynamic models have been extensively used in human pose estimation. Sidenbladh et al. [145] apply a strong prior of walking motion to determine the possible movement directions of a tracked person. The prior is learned using a large database of indexed examples. Agarwal and Triggs [1] use a set of second order dynamic models to track articulated motion of humans during walking and running. Urtasun et al. [157] use scaled Gaussian process latent variable models with incorporated dynamics to learn a low-dimensional embedding of the pose space for specific movements like golf swings and walking. In order to cover a range of possible dynamics of the tracked object, some authors have proposed an interacting multiple model (IMM) approach. In this approach multiple trackers, each with a different dynamic model, are used in parallel for tracking the target. A special scheme is used to determine how well each model describes the target’s current motion and the estimates from different trackers are then combined accordingly. A detailed treatment of different combination schemes is given in [10]. The interacting multiple model approaches based on Kalman filters have received considerable attention in the work on aircraft tracking with radars [98, 9], and an application to camera gaze control can be found in [23]. A particle-filter-based implementation of IMM can be found in [111, 20]. A drawback of IMM approaches is that the complexity of tracking increases dramatically, since now the probability distributions have to be estimated over each of the interacting models. In particle filters, the likelihood function of observations has to be evaluated for each hypothesis (particle). In visual tracking, calculating the likelihoods of particles is usually time-consuming since the visual model has to be calculated for each particle and compared to the reference model. Thus computational efforts of visual tracking with particle filters is considerably increased when using IMM approaches. For many applications, such as tracking in sports, gesture-based humancomputer interfaces and surveillance, it is difficult to find a compact set of rules that govern the target’s dynamics. Because of this, and the computational complexity associated with IMM methods, researchers usually model the target’s 1.1 Related work 11 motion using a single model. The common choices are a random-walk (RW) model or a nearly constant velocity (NCV) dynamic model; see [136] for good treatment of these. The RW model describes the target’s dynamics best when the target performs radical accelerations in different directions, e.g. when undergoing abrupt movements. However, when the target moves in a certain direction (which is often the case in sports and surveillance), the RW model performs poorly and the motion is better described by the NCV model. Thus, to cover a range of different motions, a common solution is to choose either a RW or a NCV model, and increase the process noise in the dynamic model. However, to have a sufficiently dense coverage of the probability space, and therefore a satisfactorily track, the number of particles also needs to be increased in the particle filter. This, in turn, introduces additional likelihood evaluations, which slows down the tracking. 1.1.3 Managing multiple targets A non-trivial task when tracking multiple targets is maintaining the correct identities of the targets. In the estimation theory, a classical approach to tracking multiple targets involves a detection step followed by the target-to-measurement association. In addition to the Nearest Neighbor (NN) filter, techniques such as the Joint Probabilistic Data Association Filter (JPDAF) are common solutions to the association problem [56]. The applications of sports tracking based on the NN and JPDAF approaches can be found in [171, 66, 30] and [77], respectively. Some earlier applications of the JPDAF in the context of computer vision can be found in [132, 138]. The weakness of these approaches is that they involve an explicit detection and exhaustive enumeration of the associations, which leads to an NP-hard problem. Some attempts to reduce the complexity of the association problem include gating [171, 66, 30] and treating the associations as random variables which can then be assigned via sampling [61]. Another way to tackle the problem of tracking multiple targets is to concatenate the states of all the targets into a single joint-state. This makes it possible to apply particle-filtering techniques developed for single-target tracking [123, 116]. By introducing an additional variable that indicates the number of targets to the joint-state, the authors of BraMBLe [65] were able to track a varying number of visually similar targets. This approach was adopted by Czyz et al. [39] to track soccer players of the same team. The weakness of the joint-state particle filters is that a poor estimate of a single target may degrade the entire 12 Introduction estimation. For this reason, the number of particles needs to be increased, which may render the tracker computationally inefficient for more than three or four targets [81]. Recently, some efficient schemes based on Markov Chain Monte Carlo approaches have been proposed [175, 81] to solve this problem. Vermaak et al. [162] formulated the problem of tracking visually identical targets as the problem of maintaining the multi-modality of the estimated posterior distribution of the target states. The multi-modality of the posterior is maintained by a mixture particle filter. This approach was later applied by Okuma et al. [120] and Cai et al. [30] to track players in a hockey match. With a similar rationale, Chang et al. [32] apply a Parzen estimator [165] to the particle set and use a Mean Shift to detect the modes of the posterior. A simple solution when the number of targets is known is to track each target with a separate tracker. This approach reduces the size of the statespace and allows tracking of a specific target without the need to track all of the other targets as well, thus reducing the computational complexity of the tracker. However, this approach is rather naive, since the target with the highest score will often hijack the trackers of the nearby targets [81]. Solutions based on the histogram back-projection technique [142], occlusion alarm probability principle [119] and template-based methods [34] were proposed in the literature to cope with the problem of hijacking. However, when targets appear visually similar, these approaches still fail to maintain the correct identities after targets come close to each other. 1.2 Contributions In this thesis we deal with probabilistic models for tracking persons in video data. We explore various probabilistic models concerning the visual and dynamic properties of persons and approaches to track multiple targets with the goal to arrive at solutions that allow improved tracking performance, while at the same time not significantly increasing the processing time. We propose several improvements in visual models, dynamic modelling and schemes of multiple target tracking. The original contributions of the thesis are as follows: • A color-based visual model for tracking persons is derived, which improves tracking in situations when the color of the tracked 1.2 Contributions 13 object is similar to the color of the background. The proposed color-based visual model uses a novel measure which incorporates the model of the background to determine whether the tracked target is positioned at a given location in the image. A probabilistic model of the novel measure was derived, which allows using the color-based visual model with the particle filter. The visual model does not require a very accurate model of the background, but merely a reasonable approximation of it. To increase robustness to the color of the background, a mask function is automatically generated by the visual model to mask out the pixels that are likely to belong to the background. A novel adaptation scheme is applied to adapt the visual model to the current appearance of the target. • A combined visual model is proposed, which fuses the color information with the features of local motion, to resolve occlusions between visually similar objects. The local-motion feature is calculated from a sparse estimate of the optical flow, which is evaluated in images only at locations with enough texture. A probabilistic model of the local-motion is derived which accounts for the errors in the optical flow estimation as well as for the rapid changes in the target’s motion. The local-motion model is probabilistically combined with the color-based model into a combined visual model using an assumption that the color is conditionally independent of motion. An approach is also developed to allow adaptation of the local-motion model to the target’s motion. • A two-stage dynamic model is proposed, which combines the liberal and the conservative model to better describe the target’s motion, and a method for setting the parameters of the model is derived. The two-stage dynamic model is composed of two interconnected dynamic models: the liberal and conservative. The liberal model allows larger perturbations in the target’s dynamics and is used within the particle filter to efficiently explore the state space of the target’s parameters. This model is derived by modelling the target’s velocity with a non-zero-mean Gauss-Markov process and can explain well motions ranging from a complete random walk to a nearly-constant velocity. The conservative model imposes stronger restrictions on the target’s velocity and is used to estimate the mean value of the Gauss-Markov process in the liberal model, as well as for regularizing the estimated state from the particle filter. We 14 Introduction have provided an analysis of the model’s parameters and proposed a rule-of-thumb rule for estimating the spectral density of the liberal model. • A context-based scheme for tracking multiple targets is proposed, which allows tracking with a linear computational complexity. In the context of observing the scene from a bird’s-eye view, the recorded images can be partitioned into regions, such that each region contains only a single target. This means that, given a known partitioning, the Bayes filter for tracking multiple targets can be simplified into multiple singletarget trackers, each confined to the corresponding partition in the image. A parametric model of the partitions is developed, which requires specifying only the locations of the tracked targets. Since the partitions are not known prior to a given tracking iteration, a scheme is derived which iterates between estimating the targets’ positions and refining the partitions. Using this scheme we simultaneously estimate the locations of the targets in the image as well as the unknown partitioning. 1.3 Thesis outline and summary The remainder of this thesis is structured into the following six chapters. In Chapter 2 we give a detailed description of the probabilistic framework called the particle filters, which we use for tracking. We first pose the tracking as a problem of stochastic estimation and show how the well-known conceptual solution emerges with application of the recursive Bayes filter. After briefly reviewing historical approaches to recursive filtering, we show how the particle filters emerge from the recursive Bayes filter when Monte Carlo methods are applied to solve the recursions. In Chapter 3 we present the color-based visual model, the novel measure of presence, its probabilistic model, the dynamic background subtraction scheme and the scheme by which the visual model is adapted to the target’s appearance. We demonstrate in experiments how the proposed visual model improves the tracking performance in situations when the target moves over a cluttered background. We note that the purely color-based model cannot resolve ambiguities which arise in situations when the target undergoes an occlusion by a visually-similar object. 1.3 Thesis outline and summary 15 To cope with the drawbacks of the purely color-based visual model, we propose in Chapter 4 a novel local-motion-based model which is probabilistically combined with the color-based model into a combined visual model. We describe how the local-motion model is calculated from the optical flow. A probabilistic model of the local-motion is derived, and a scheme to adapt the local-motion to the current appearance of the target is presented. Experiments demonstrate how the proposed visual model can be used to resolve the visual ambiguities and improve tracking performance with examples of tracking people in surveillance, sports and with an example of tracking person’s palms. Chapter 5 is dedicated to the problem of modelling the target’s dynamics. We propose a two-stage dynamic model and analyze how different values of parameters influence the structure of this model. We also propose a two-stage probabilistic tracker and confirm the superiority of the two-stage dynamic model over two widely-used dynamic models, both, quantitatively and qualitatively. In Chapter 6 we discuss how the context within which we observe a scene can be used to simplify the Bayes recursive filter for multiple targets. Based on this discussion, and using the two-stage dynamic model, we derive a novel multi-target tracker. The proposed tracker can track multiple targets with a very small number of particles in the particle filter, which effectively reduces the processing time required for a single tracking iteration. The proposed multiple-target tracking scheme is evaluated on a demanding data set from sports. In Chapter 7 we summarize this thesis. We discuss the achieved results and provide an outlook for future venues of research in the field of tracking persons from video data. 16 Introduction He who controls the present, controls the past. He who controls the past, controls the future. George Orwell (1903 – 1950) Chapter 2 Recursive Bayesian filtering Over the past fifty years, Bayesian approaches called Bayes recursive filters have been shown to provide a solid theoretical ground for probabilistic tracking and have been successfully applied to tracking in visual data. The central point of these approaches is to present all information about a target at one time-step with a posterior probability density function (pdf) over the target’s parameters, and estimate this pdf recursively as time progresses. The popularity of the recursive Bayes filter comes from its ability to handle various uncertainties associated with the target’s dynamics as well as sensors, by which the target is perceived. One of the early approaches that come from the class of Bayes recursive filters is the well-known Kalman filter [76]. Although the Kalman filter has been applied with some success to various problems of tracking and estimation, it has a major drawback. In particular, it makes certain assumptions, which are usually too restrictive to apply for visual tracking. In the late nineties, filters based on Monte Carlo methods have been proposed to solve the recursions in the Bayes filter. These approaches, known as the particle filters have gained considerable popularity in various areas of tracking and estimation and are the basis for the tracking algorithms which we propose in this thesis. Drawing heavily from the literature [101, 47, 43, 57, 6, 33], we provide in this chapter a derivation of the particle filters framework and discuss some implementation issues. The outline of this chapter is as follows. In Section 2.1 we pose tracking as a problem of stochastic estimation in dynamic systems, which can be treated within the Bayesian framework. In Section 2.2 we introduce a principle at the core of the recursive Bayes filter and in Section 2.3 we describe how different terms in the filter relate to different parts of the stochastic dynamic system. Some 17 18 Recursive Bayesian filtering historical approaches to recursive filtering are briefly overviewed in Section 2.4. In Section 2.5 we finally show how the particle filters emerge when applying Monte Carlo methods to the Bayes filter. 2.1 Tracking as stochastic estimation We can think about tracking as a process of obtaining some interesting information about a possibly moving target as time progresses. For example, in visual tracking, where an object is tracked in a sequence of images, the interesting information might be the position and size of the object in the image. To obtain such information, the target has to be modelled, and then this model is used to measure whether the target is located at a given position in the image. If certain values are assigned to the parameters of the model, we say that the target is in a certain state. The space of all parameter values is then called the state space. Formally, we denote the target’s state at time-step k by xk. Starting with an initial state x0, we denote the sequence of all states up to the current time-step k by x0:k = {x0,. ..,xk}. The sequence x0:k is governed by a state evolution process, which is defined by the target’s dynamics. Since in general the evolution process is not fully deterministic and because there is always some uncertainty associated with how well the true target dynamics are modelled by the evolution process, the state of the target at any time-step is usually regarded as a random variable. Therefore, the state of the target at a given time-step is not described by a single value in the state space, but rather by a distribution of more likely and less likely values – a probability density function (pdf). The information about the state of the target is accessed through the measurement process, which for every state xk produces a measurement (an observation) yk. We denote the sequence of all observed measurements up to time-step k by y1:k = {y1,. . .,yk}. Due to the uncertainty introduced by the imperfect target model and the inherent uncertainty of the measurement process, the measurements are also considered random variables. With the definitions above, we see that in tracking we wish to calculate the current state of the target, which is governed by a stochastic process. Furthermore, the information about the state is accessed through another stochastic process. Therefore, tracking can be considered a problem of stochastic 2.2 Recursive solution 19 estimation of the target’s state (model parameters) as time progresses. A methodology that is well designed to handle the uncertainties involved in the stochastic estimation is provided from Bayesian theory. From a Bayesian point of view, starting from a known prior distribution p(x0) over the initial state x0 of the target, all the interesting information about the sequence of the target states x0:fc is embodied by the posterior distribution p(x0:fc|yi:fc), which tells us the probability of how likely various state sequences x0:fc are in light of the observed sequence of measurements y1:fc. If the density p(x0:fc|yi:fc) is known, then the estimates of functions of x0:fc can be calculated by mathematical expectation. For example, an estimate which minimizes the mean-squared error of the observations, is the minimum-mean-squared error (MMSE) estimate (X0:fc)p(x0:fc|y1:fc) = /x0:fcp(x0:fc|y1:fc)dXo:fc, where (•) denotes the expectation operator. Alternatively, maximum a posteriori estimate MAP(x0:fc) can be obtained by choosing values for x0:fc which maximize the posterior p(x0:fc|yi:fc), i.e., MAP(x0:fc) = argmax[p(xo:fc|yi:fc)]. xO:fc Note that during tracking we are usually interested only in the current state xfc of the target and not the entire sequence of states x0:fc. In Bayesian terms, this means that we only require a marginal distribution p(xfc|y1:fc) at time-step k. Furthermore, it is a beneficial if the posterior can be calculated recursively using only the posterior from the previous time-step p(xfc_i|yi:fc_i) and the current observed measurement yk. This procedure is referred to in the literature as the Bayesian recursive filter and is derived next. 2.2 Recursive solution To derive the recursive solution for calculating the marginal posterior p(xfc|y1:fc), we start from a complete posterior p(x0:fc|yi:fc) and apply the Bayes rule p(XM,yM) = r(**,y.*) . (2.1) The numerator of (2.1) can be further factored using the chain rule into P(x0:fc, yi:fc) = P(yfc|yi:fc-1, X0:fc)p(x0:fc|yi:fc-l)p(yi:fc-l), (2.2) 20 Recursive Bayesian filtering while the denominator of (2.1) is factored into p(y1:k)=p(yk|y1:k-1)p(y1:k-1). (2.3) Plugging (2.2) and (2.3) back into (2.1) and cancelling the term p(y1:k-1) we arrive at p(yk|y1:k-1, x0:k)p(xk|xk-1, y1:k-1)p(x0:k-1|y1:k-1) p(x0:k|y1:k)= ,(2.4) p(yk|y1:k-1) which is recursive in that p(x0:k|y1:k) is calculated from p(x0:k-1|y1:k-1). The posterior of the current state xk is obtained by marginalizing the complete posterior p(x0:k|y1:k) (2.4) over the sequence of past states x0:k-1, P(xfc|yi:fc) = p(xo:k\yi:k)dXO:k-l, which gives p(xk|y1:k) /l?(yfc|yi:fc-l,Xo:fc)l?(xfc|Xo:fc_i,yi:fc_i)l?(Xo:fc_i|yi:fc_i)dXo:fc_i p(yfc|yi:fc-i) p(yfc|yi:fc-i,xt)/p(xfc|xfc_1,y1:fc_1)p(xfc_1|y1:fc_1)dxfc_1 p(yfc|yi:fc-i) ~" Note that although (2.5) does admit to a recursive form, it cannot be calculated recursively in general. The reason is that the terms p(yk\yi-.k-i, xfc) and p(xfc|xfc_i,yi:fc_i) are conditioned on the entire sequence of observations yi:fc_i and thus require storing the sequence y^-i1. Therefore, to make (2.5) a proper recursion, additional restrictions have to be imposed. The first restriction is that, given the current state xfc, the current observation yfc is conditionally independent from all previous observations y1:fc_i: p(yfc|yi:fc-i,xfc)=p(yfc|xfc). (2.6) The second restriction is that, given the state xk-1 from the previous timestep (k - 1), the cu observations y1:k-1, step (k - 1), the current state xk is conditionally independent from all previous p(xfc|xfc_i,yi:fc_i)=p(xfc|xfc_i), (2.7) iNote that the normalization p(yk\yi-.k-i) in (2.5) also depends on the sequence yi:fc-i; however, it is not troublesome, since it is a constant and can be calculated by integrating the numeratorp(yfc|yi:fc-i) = /p(xfc|yfc|yfc,yi:fc-i)p(xfc|yfc,yi:fc-i). _ _ 2.3 Bayes filter for a stochastic dynamic system 21 which means that the state sequence is a first-order Markov process. In the filtering literature, p(yfc|xfc) and p(xfc|xfc_i) are commonly referred to as the likelihood function2 and the transition distribution, respectively. Using the restrictions (2.6) and (2.7) we can now rewrite (2.5) as a proper recursion p(yfc|xfc) J>(xfc|xfc_i)j?(xfc_i|yi.fc_i)dxfc_i p(Xfc|yi:fc) = ----------------------j>(y*|yi*-i)-----------------------' (2.8) Equation (2.8) is the well-known recursive Bayes filter and constitutes a central equation for many probabilistic schemes for tracking and estimation in stochastic dynamic systems. In the following we describe how different terms in the recursion (2.8) conceptually relate to a class of stochastic dynamic systems, which we consider in this thesis. 2.3 Bayes filter for a stochastic dynamic system The stochastic dynamic system is defined by a set of, possibly nonlinear, system equations xk = f(xk-1,vk), (2.9) yk = g(xk,nk), (2.10) where (2.9) is the process evolution model and (2.10) is the measurement process model. According to (2.9) and (2.10) the state xk-1 evolves through a system transition function f(xk-1, vk) which is driven by the process noise vk. The hidden state xk is then observed through the observation function g(xk,nk), where nk is the observation noise. An equivalent probabilistic model of the dynamic system (2.9, 2.10) is shown in Figure 2.1 as a graphical model. The transition density p(xk|xk-1) is completely defined by the transition function f(xk-1, vk) and the process noise distribution p(vk), while the likelihood function p(yk|xk) is specified by the observation function g(xk,nk) and the measurement noise distribution p(nk). 2Note that p(yk|xk) is the probability of observing the measurement yk, given that the system is in the state xk. However, p(yk|xk) is also the likelihood of the system being at state xk, given the observation yk; this is the reason why p(yk|xk) is often referred to as simply the likelihood function in filtering theory. 22 Recursive Bayesian filtering Observed yfc-i iL yfc iL p(yfc|xfc) yfc+i iL Unobserved > / Xt- i ) * ( X'- \ » / Xfr i 1 \ „ * \ ^ P(xfc|xfc_!) ^ / * \ ^ / * Figure 2.1: A graphical model of the stochastic dynamic system with the unobserved states xfc and the observed measurements yk. The hidden state xfc of the system evolves with time as a partially observed first order Markov process according to the conditional probability density p(xfc|xfc_i). The observations yk are conditionally independent given the state and are generated according to the probability density function p(yfc|xfc). In the literature, the recursion of the Bayes filter (2.8) is usually broken into two steps: prediction and update. In the prediction step, the posterior p(xfc_i|yi:fc_i) from the previous time-step (k - 1) is propagated through the dynamic model p^lx^) to yield the predictive distribution p(xfc|yi:fc_i) using the Chapman-Kolmogorov relation: p(xfc|yi:fc_i) = /p(xfc|xfc_1)p(xfc_1|y1:fc_1)dxfc_1. (2.11) Note that (2.11) is just the integral from the right-hand side of the Bayes recursion (2.8). In the update step the predictive distribution is updated using the likelihood p(yfc|xfc) associated with the observed measurement yk, and normalized to yield the new posterior p(xfc|y1:fc), p(xfc|yi:fc) p(yfc|xfc)p(xfc|y1:fc_1) p(yfc|yi:fc-i) ; (2.12) where the normalization function is calculated by integrating the numerator over all values of xfc, i.e., p{yk\yi:k-i) = /p(yfc|xfc)p(xfc|y1:fc_1)dxfc. At first glance, the calculation of recursion (2.11, 2.12) may appear a straightforward matter. Unfortunately, it is merely conceptual, and analytic solutions exist only for a handful of systems. The reason is that the integrations 2.4 Historical approaches to recursive filtering 23 involved in calculating p(xk|y1:k-1) and p(yk|y1:k-1) generally do not have closed-form solutions. Furthermore, even if the posterior p(xk|y1:k) can be found, estimates of the target’s state, such as MMSE, are likely to be intractable [57]. A number of approaches have been proposed in the literature to make the recursions (2.11, 2.12) tractable. We give a brief overview of the better known ones next. 2.4 Historical approaches to recursive filtering When the measurement and the system models are linear and the noise in both models is Gaussian, closed-form solutions of the integrals in the recursions of the Bayes filter exist. In particular, if we start from a Gaussian prior p(x0), and assume that the prediction step (2.11) and the update step (2.12) of the recursive Bayes filter are linear operations on Gaussians, then the resulting posterior is a Gaussian as well. Thus the recursions (2.11, 2.12) can be interpreted as a recursive estimation of the posterior’s p(xk|y1:k) mean and covariance. The filter that emerges from this is the well-known Kalman filter [76], which was originally derived by R. E. Kalman in the early sixties. Since then, it has been applied extensively to various filtering problems and many variants have been presented; see, e.g. [75, 133, 73, 67, 126]. In practice, the use of the Kalman filter is limited by the nonlinearity and the non-Gaussian nature of the physical world. As an attempt to deal with these nonlinearities, a modification called the extended Kalman filter (EKF) was proposed (see, e.g. [67, 5]). This filter locally linearizes the system transition and measurement models and then applies the standard Kalman filter to analytically calculate the posterior p(xk|y1:k). Usually, a first-order Taylor series expansion is used for the linearization. To obtain better approximations, higher-order variants of EKF have been proposed [155]. However, the additional complexity, that these higher-order extensions entail, has prevented their widespread use [6]. There are two sources of inaccuracy in the EKF: The first is the linearization of the system and measurement models, which is carried-out only at a single point and does not take into account the uncertainty associated with that point. The second problem is the assumption that the prior and posterior are Gaussians. In fact, the nonlinearities in the models will result in posteriors being non-Gaussian and sometimes even multi-modal after the propagation step. This 24 Recursive Bayesian filtering undermines the Gaussian assumption in the EKF and consequently results in degraded performance or even in complete failure of the tracker. As an attempt to better deal with the nonlinearities of the measurement and the system transition model, the so-called unscented Kalman filter (UKF) was proposed by Julier and Uhllman [69, 71, 70, 72] in the late nineties. The basic idea behind the UKF is to avoid the linearization of the measurement process and system dynamics all together. The posterior is, like in EKF, still approximated by a Gaussian, but this Gaussian is encoded through a set of points called sigma points. These points are carefully chosen along the principal axes of the Gaussian such that they can capture its covariance and the mean value. In the propagation step, the points are propagated through the true nonlinear system and capture the mean and the covariance of the new posterior accurately up to the third order (in terms of Taylor expansion) [69]. This is a significant improvement in comparison with the standard EKF, whose approximation is accurate only up to the first order. Furthermore, the computational complexity of the UKF is lower than that of the EKF, since it does not require linearization. Separately from the UKF, another filter called the central-difference Kalman filter (CDKF) has been proposed [79, 117]. Der Merwe [161, 159] observed that the UKF and CDKF algorithms are related through an implicit use of a technique called weighted statistical linear regression [94], which is responsible for determining the locations, as well as the weights of the sigma points. Thus these filters have been given a common name: sigma-point Kalman filters (SPKF) [159]. Although SKPF is more accurate than EKF, it still assumes that the posterior can be approximated by a single Gaussian. As such, it is still prone to failure in certain nonlinear, non-Gaussian problems which involve multi-modal and/or heavy-tailed posterior distributions. As an alternative to approximating the posterior using only a single Gaussian, a Gaussian sum filter (GSF) was proposed [150, 4], which applies a mixture of Gaussians to better model the multimodality in the posterior distribution. The motivation behind this was the fact that any non-Gaussian density can be sufficiently well approximated by a sufficiently large number of Gaussians [165]. Thus the GSF was conceptually presented as a bank of extended Kalman filters, where at each time-step, each component of the Gaussian mixture was propagated using an EKF. One drawback of the filter was that the number of components in the mixture had to be specified in advance. Another drawback was that in cases 2.4 Historical approaches to recursive filtering 25 where the observation noise and/or the system noise were also approximated by a Gaussian mixture model, the number of components in the posterior increased exponentially after each tracking iteration and methods for reducing the number of components had to be used. Furthermore, since the GSF used EKF to propagate the components of the mixture, it inherently suffered from the problem of the first-order linearization in the EKF. In contrast to the parametric models, which use Gaussians or Gaussian mixtures to calculate the recursions in the Bayes filter analytically, non-parametric methods, called grid-based methods, have been proposed to calculate the recursions of the Bayes filter numerically. These approaches are based on discretizing the continuous state-space into a predefined set of cells and then evaluating the posterior only at these cells. In some variations, the posterior is approximated using a discrete distribution over the grid cells [149]. Other approaches use splines [27], step functions [86], or apply quadrature methods [166]. The advantage of these approaches is that they simplify the recursions of the Bayes filter, as each point on the grid is updated independently of the others. Therefore, given enough cells, they hold a potential of approximating fairly complex distributions. A significant drawback of the grid-based methods is, however, that they require specifying the number of cells in advance, and the grid has to be sufficiently dense to allow good approximation of the posterior. With increasing dimension of the state-space, the computational burden then increases dramatically. Another drawback is that, because the state of the target is moving through the continuous state-space, the grid also has to move along with the target, and has to scale accordingly to cover the significant probability-mass corresponding to the target’s current state. This is a nontrivial task which further increases the computational burden of the method. In the late eighties and early nineties, with the advent of increased computational power in computers, significant steps have been made toward calculating the recursions in the Bayes recursive filter by means of simulation [83, 82]. As a result, a number of methods have been independently developed in such fields as statistics, econometrics, engineering and computer science, that were based on evaluating the integrals of the Bayesian recursion by Monte Carlo sampling. These methods are commonly known as sequential Monte Carlo methods or particle filters and are described next. 26 Recursive Bayesian filtering 2.5 Monte-Carlo-based recursive filtering The aim of recursive Bayesian filtering is to calculate the posterior distribution p(xfc|yi:fc) of the target’s state xfc given some observations yk. Once the posterior is known, various estimates of the target’s state can be calculated. For example, the minimum-mean-squared-error (MMSE) estimate of the target state can be obtained by taking the expectation (xfc)p(xfc|yi:fc). However, the MMSE estimate involves solving a high-dimensional integral which is often intractable. A common approach to evaluate intractable high-dimensional integrals is to apply Monte Carlo integration, where the integrals are replaced by summations over discrete sets of points. In this section we will first briefly overview the theoretical background of Monte Carlo integration and show how it is used to approximate the recursive Bayes filter. We will point out some drawbacks and implementation issues of this approximation and discuss solutions and simplifications proposed in the literature. In light of these, we will finally present a well-known approximate recursive Bayes filter called the particle filter. 2.5.1 Perfect Monte Carlo sampling First we will discuss how Monte Carlo methods can be conceptually used to approximate integrals. For the sake of clarity we will drop the subscripts (•)*, which indicate the time-steps, for now and reintroduce them later, when we consider application of these methods to Bayesian filtering. Note that in tracking we are generally interested in calculating expectations over some posteriors p(x|y), which are integrals of type /(/)= //(x)p(x|y)dx, where /(•) is p(x|y)-integrable function of x. In a perfect Monte Carlo sampling, the integral /(/) can be approximated as follows. First, N independent samples {x(i)}i=i...w are drawn from the posterior p(x|y). Using these samples, the posterior can be approximated by the following empirical estimate 1 N pw(x|y) = J]čxW(x), (2.13) i=\ which is essentially a set of Dirac-delta functions 6x(i) (x) located at sampled points xW. Replacing p(x|y) with its empirical estimate pw(x|y), the integral /(/) 2.5 Monte-Carlo-based recursive filtering 27 (2.13) can now be numerically approximated with IN(f) using the Monte Carlo integration W) = [ f(x)pN(x\y)dx 1 N f = mE/ /(^)^x(o(x)dx 1 W = ir/(x('>). (2.14) The validity of the approximation /(/) « IN(f) is guaranteed by the strong law of large numbers ([43], page 7), which states that the average of many independent random numbers with a common mean and finite variance converges to the common mean lim IN(f) = 1(f), with probability one. Moreover, if the variance a) of f(x) with respect to p(x|y) is finite, the central limit theorem tells us that by increasing N, the difference between the integral /(/) and its approximation IN(f) approaches a normal distribution with variance ^N[IN(f) - I(f)]^Af(0,o-2f), (2.15) N—>oo J where => denotes convergence in distribution ([43], page 7). From (2.15) we see that the accuracy of the estimator IN(f) increases with the number of samples and does not depend directly on the dimension of the integrand3. In contrast, in a deterministic numerical integration the accuracy of the integral decreases as the dimension of the integrand increases ([43], page 7). The above discussion tells us that if we are able to provide independent samples from p(x|y), then the integrals of type (2.13) can be easily approximated. In practice, however, p(x|y) will be multivariate, nonstandard and known only up to a proportionality constant. Thus sampling from p(x|y) directly is usually impossible and alternative solutions are required. One common solution is to use importance sampling. 3Note, however, that ?f2 in (2.15) may still grow appreciably with the dimension of the integrand [57]. 28 Recursive Bayesian filtering /(/)= //(x)p(x|y)= //(x)^^|^p(x|y)dx=^ f /(x)g(x|y)w(x)dx! 2.5.2 Importance sampling Let p(x|y) be some distribution which is difficult to sample from, but can be evaluated up to a proportionality constant as p(x|y) = Cpp(x|y). (2.16) Let g(x|y) be another distribution which is easy to sample, can be also evaluated point-wise, and has the same support as p(x|y)4. In the literature, g(x|y) is usually called the importance function or the proposal distribution. With the above definitions, the integral (2.13) can be rewritten 1 1 g(x|y) <^X|y) Cp where p(x|y) was absorbed into w(x) = ^J4. (2.17) Qwy) The proportionality constant Cp is obtained by integrating both sides of (2.16), Cp= /p(x|y)dx= /g(x|y)w(x)dx, and /(/) can be rewritten as /(/) = | //(x)g(x|y)w(x)dx. (2.18) Since g(x|y) can be easily sampled, it can be approximated by an empirical Monte Carlo estimate (2.13) 1 N gw(x|y) = J]čxW(x), (2.19) i=\ where {xW}i=1:W is a set of independent and identically distributed (i.d.d.) samples from g(x|y). The integral /(/) (2.18) can now be approximated by IN(f), where 1 qN(x\yjw(x)dx 1 W) f ^ /(x)gw(x|y)w(x)dx 11 N N y /(x«)W(x«). ffiti«'(x(<))JV 4We say that q(x|y) has the same support as p(x|y) when it is nonzero at least for all those x, for which p(x|y) is nonzero as well. _ 2.5 Monte-Carlo-based recursive filtering 29 By cancelling the term L in the above equation and taking the normalization into the sum, we can further rewrite the integral as lN (f) = yN /(x«V« , w® = !f(xW) ¦ (2.20) It is beneficial to note that, in terms of a perfect Monte Carlo sampling, the integral IN(f) can be viewed as an expectation of /(x) under the following empirical distribution 1 N č„(x|y) = ^ j>(x«)yi:fc) Note that (2.29) is not a proper recursion, since the proposal distribution is conditioned on y1:fc, and thus requires storing the entire sequence of past observations. To avoid that, the proposal is usually modified [6] into g(x^|Xo:Ll,yi:fc) = ^(x^lXfcll,yfc)- (2-25) FYom this definition of the importance function, the equation for calculating the nonnormalized importance weights6 wf takes the following form wf = wf rty^fWf^\ (2.26) g(xjfet;|Xfc-i»yfc) and the posterior p(xfc|y1:fc) in the current time-step is approximated by the following random measure 1 N Mxfc|yi:fc) = J]4\«(xfc), (2.27) i=\ 5We have dropped the term p(yk|y1:k-1) from the denominator of the posterior since it is independent from the sequence of the states x(0i:)k and is constant at k. 6By nonnormalized importance weights we refer to a set of weights which do not sum to one. 2.5 Monte-Carlo-based recursive filtering 31 where wf are now normalized importance weights, i.e., wf = wf (Yl?=i wf)~l. An approximate recursive Bayesian filter that follows from (2.27) and (2.26) is called the Sequential Importance Sampling (SIS) algorithm and is summarized in Algorithm 2.1. Input: • Posterior from the previous time-step p(xfc-i|yi:fc-i)»{xL1,4-i}ili Output: • Posterior from the current time-step v{*k\yi-.k) ~ {*f ,wf}f=l 1. For i = 1 : N, • Sample a new particle: x^ ~ q{*k\*f_vyk). • Assign a weight: wf = ^l/^1^^''^^. 2. For i = 1 : N, • Normalize weights: w^ = ** (i). 3. The new random measure is an empirical approximation to the true posterior: {x^W^i^l ~Kxfc|yi:fc). Algorithm 2.1: Sequential Importance Sampling (SIS) algorithm. 2.5.4 Degeneracy of the SIS algorithm The SIS algorithm has a major practical drawback. Since we are recursively multiplying weights from the previous time-step with the weights in the current time-step, the variances of weights tend to increase [43] with time and thus the approximation of the posterior deteriorates. Furthermore, any estimate based on this approximation (e.g., MMSE stimate) will deteriorate as well. Authors of [6] have reported that in practice, after a few iterations, most of the normalized 32 Recursive Bayesian filtering weights will be close to zero. From a computational standpoint this means that most calculations are expended on calculating the weights whose contribution to the estimates are negligible. This effect is in the literature commonly known as the problem of degeneracy and is usually tackled by the following two approaches [43, 6, 57]: 1. Choice of a suitable importance function. 2. Application of resampling. Choice of the importance function In view of alleviating the degeneracy of weights in SIS, the optimal choice of the importance function is that which minimizes the variance of the true weights, conditioned on the previous state and the current measurement [6]. It has been shown in [43] that in terms of variance minimization, the optimal importance function is q(xk|xk-1,yk) = p(xk|xk-1,yk). There are two cases in which such an importance function can be used [6]. One is when the statespace of xk is discrete and finite. The other is when the measurement model is linear and the noise in the system and measurement model is Gaussian. Unfortunately, as we will see in later sections, these restrictions do not apply for our applications of visual tracking. Various authors have proposed methods that use other, suboptimal, importance functions. Among them are the auxiliary particle filter [129], Gaussian mixture sigma-point particle filter [160], annealed particle filter [41], partitioned sampling [107] and layered sampling [124], to name but a few. An alternative choice of the importance function is the prior transition model q(xk|xk-1)=p(xk|xk-1). (2.28) What is appealing about this choice is that it requires us to sample from the system dynamic model, which is often easy. Furthermore, the weight update equation (2.26) simplifies to ~ (i) (i) wk = wk_ 1P(yfc|xfc). (2.29) However, choosing the prior for the importance function may also have a deteriorative effect on the performance of SIS. This happens in situations when 2.5 Monte-Carlo-based recursive filtering 33 the likelihood p(yfc|xfc) is quite peaked in comparison to the prior p(xfc|xfc_1). An example of such a situation is illustrated in Figure 2.2. Only a few particles generated from the prior are located at the peak of the likelihood function. As a result, majority of the particles have a weight very close zero. Note that this is in fact equivalent to the problem of degeneracy which we are trying to avoid. Nevertheless, the simplicity of implementation, and weight calculation that such a choice of importance function offers, makes it a very popular choice among practitioners. p(xfclxi-i) Figure 2.2: Samples (depicted by circles) are drawn from the prior p(a;fc|aL1). A weight is assigned to each sample according to the likelihood function p(yk\xk). The weight of each sample is indicated by the radius of the circle; a large radius corresponds to a large weight, while a small radius corresponds to a low weight. Note that since only a few samples are generated at the peak of the likelihood function, the majority of samples have weights close to zero. Application of resampling The second approach to alleviating the effects of degeneracy is to use resampling whenever a significant degeneracy is observed. Recall that the posterior p(xfc|y1:fc) is approximated by the following random measure pjv(xfcly1:fc) = VW wkl)S NI: N Pri^ —— Xl I W k ¦ (2.30) _ 34 Recursive Bayesian filtering under which all new particles ±f have equal weights L and are still approximately distributed according to the original posterior p(xfc|y1:fc). In other words, resampling avoids degeneracy of weights by generating a new random measure pjv(xfclyi.fc) = VW 6 w(xfc) • z-^i=i N xfc by selecting the particles with high weights multiple times and discarding those with smaller weights. Therefore, the particles are herded in regions of high probability density of the posterior p(xfc|y1:fc). Resampling is required especially in those situations, when the particle set becomes degenerated. In the literature [101, 13, 6], an effective sample size Neff, sometimes also called the survival diagnostic [106], is proposed as a measure of degeneracy Neff = AT 1 ,: ¦ (2.31) TIM*)2 Therefore, when Neff falls below a predefined threshold, the particle set is resampled. Many resampling schemes have been proposed in the literature: stratified sampling and residual sampling [101], systematic resampling [83], deterministic resampling [106], resampling based on ordered statistics [134, 31], residual sampling [101], regularized sampling [47], etc. We use a deterministic resampling [106] in our implementation, since it is simple to implement and the complexity of the algorithm is O(N). The deterministic resampling is summarized in Algorithm 2.2. It is important to note that while resampling does alleviate the problem of degeneracy it introduces other problems. One is that particles with higher weights are chosen multiple times in the resampling step. This can reduce the diversification of the particles, since the new particle-set contains multiple copies of the same particles. This is known in the literature as sample impoverishment and can be critical especially when the system noise is small [6]. In those situations, after a few iterations, the entire particle set will collapse into a single point [6]. Nevertheless, despite the drawback of possible sample impoverishment, resampling deals well with the problem of degeneracy and has as such become an integral part of all modern particle-set-based recursive Bayesian filters. 2.5.5 Particle filters If resampling (e.g., Algorithm 2.2) is used with the SIS algorithm (Algorithm 2.1) and the effective sample size Neff (2.31) is used to decide when to resample, we 2.5 Monte-Carlo-based recursive filtering 35 Input: Posterior p(xfc|yi:fc) « {x^, wf}*=l Output: Resampled posterior p(xfc|yi:fc) « {xj^, ^}Li 1. Generate a cumulative distribution {S}f=l from the particle weights . Äi) ^ „,W 2. Initialize: I = 1 3. For i = 1 : JV, • while (jj >cW) : / + +. • choose 4° = 4° and set wk] = 7r- Algorithm 2.2: Deterministic resampling. obtain the so-called generic particle filter, which is summarized in Algorithm 2.3. A special variant of the generic particle filter has become popular and widely used for visual target tracking in the last decade. This variant assumes the following two simplifications: • The prior is used in the importance function: g(xfc|xfc_1,yfc)=p(xfc|xfc_1). • Resampling is executed at each iteration (this is equivalent to setting Nthres = oo in the Algorithm (2.3). The variant that uses the above two simplifications has emerged under various names like the bootstrap particle filter (BPF) [53] (published in 1993) and the Condensation algorithm [63] (published in 1996) to name just the more visible 36 Recursive Bayesian filtering Input: • Posterior from the previous time-step P(xfc-l|yi:fc-l) ~ {^t-Vw{k-l}f=l Output: • Posterior from the current time-step 1. Evaluate Neff using (2.31). 2. If Neff > Nthres Resample {xj^, wjjjüi using Algorithm 2.2. r W (i) iM c~(i) I ^N 3. For i = 1 : N, new particle (i) p(yk|x(ki))p(x(ki)|x(ki-) 1) • Sample a new particle x^ ~ g(xfc|xj2.i,yfc) • Assign a new weight ^ ^(xfc lxfc_i>yfc) 4. For i = 1 : iV. • Normalize the weights: wf = **' m E™=i *r 5. The new random measure is an empirical approximation to the posterior: W Wn {X^,«;^}^! Wp(xfc|yi:fc) Algorithm 2.3: A generic particle filter. two. The bootstrap particle filter is summarized in Algorithm 2.4. Note that since the particle filter approximates the posterior over the target’s state by a weighted sample-set, p(xfc|y1:fc) « {*f Mk)^, the minimum-mean-squared-error (MMSE) estimate xfc is approximated as N ±k = ^*fwf. (2.32) i=\ 2.5 Monte-Carlo-based recursive filtering 37 Input: • Posterior from the previous time-step P(xfc-l|yi:fc-l) ~ {*(k-Vwik-l}?=l Output: • Posterior from the current time-step p(xfc|y1:fc) ~ {^f ,wf}^x 1. Resample {xj^, wjj-Jui using Algorithm 2.2. W (i) iM (~{i) l iN • {x>k-vwk-i}Z:i -»¦ K-P jv}i=l 2. For i = 1 : JV, • Predict by sampling a new particle: x^ ~p(xfc|xj2.i) • Update by assigning a new weight: wf = v{yk\^k) 3. For i = 1 : JV, • Normalize weights: w^ = ** _U) 4. The new random measure is an empirical approximation to the true posterior: {*t\w(k}?=i ~Kxfc|yi:fc) Algorithm 2.4: Bootstrap particle filter. All trackers which will be developed in the following chapters are based on the bootstrap particle filter. We therefore conclude this chapter with a detailed illustration of a BPF iteration on an example of one-dimensional tracking. Example 1. Consider an example of estimating the horizontal position of the frog in Figure 2.3 from time-step (k - 1) to time-step k. At time-step k we know an empirical estimate of the posterior over horizontal position from the previous time-step (k - 1) in the form of a particle set p(xk.l\yl.k_l) ~ {x^w^-Li (Figure 2.3a). We wish to calculate an approximation to the the posterior in the current time-step k. The main steps of the BPF iteration are illustrated in Figure 24 and are listed as follows: 38 Recursive Bayesian filtering k r^ Of '^^^-4^-w 5 O "> X (a) (b) Figure 2.3: Example of estimating horizontal position of the frog in two consecutive images (a) and (b). The posterior p(xk-1|y1:k-1) at (k - 1) is approximated by particles which are depicted by circles below the frog. The aim of tracking is to estimate the posterior in the current time-step k, as the frog changes its position. • We start with the posterior j?(ajfc-i|j/i:fc_i) from the previous time-step (k - I) (Figure 24, stage 0). • First, all particles are resampled by a deterministic resampling (Algorithm 2.2). The resulting empirical distribution {x^,^}^ is still an approximation to p{xk-i\y1:k-i). The particles that originally had higher weights are multiplied many times, while those with small weights are discarded (Figure 24, stage 1). • Then each particle is simulated according to the system’s dynamic model p{xk\xk-i). The resulting random measure {xf,±}f=l is an approximation to j?(ajfc|j/i:fc_i), which is the prediction of the posterior from the previous time-step (Figure 24, stage 2). • Finally, each particle is assigned a weight according to the likelihood function p(yk\xk). All weights are normalized such that they sum to one, and the resulting particle set {xf,wf}f=l is an empirical approximation to the posteriorp(xk\yl:k) (Figure 24, stage 3). The average position (MMSE estimated state) xk can be calculated from the approximation to p(xk\yl:k) and is depicted in Figure 24 (left column, last row) by an orange circle. 2.5 Monte-Carlo-based recursive filtering 39 Figure 2.4: An illustration of a single iteration of the bootstrap particle filter (Algorithm 2.4) with four stages from Example 1. The posterior over the frog’s horizontal position (stage 0) is presented by twelve particles, where each particle is depicted by a purple circle and the weight of the particle is indicated by the circle’s radius: the larger the radius, the larger the weight. The three subsequent stages are resampling (stage 1), prediction (stage 2) and update (stage 3). The left column shows how particles evolve with respect to the frog’s position, while the right column shows different stages in more detail. The average state (position) of the frog in the current time-step is depicted by an orange circle overlaid on the particles in the left column of stage 3. 40 Recursive Bayesianfiltering Blue flower, red thorns. Blue flower, red thorns. Blue flower, red thorns. Oh, this would be so much easier if I wasn’t color-blind! Donkey in Shrek Chapter 3 Color-based tracking One of the essential parts of visual tracking is the visual model of the target, which allows us to evaluate whether a target is present at a given location in the image. It might seem that we require a very detailed visual model to discriminate the tracked object from its surrounding. However, too detailed visual models are not appropriate when tracking nonrigid objects such as people. While moving, their appearance changes and the visual model has to adapt to these changes, which is not easily achievable in a robust way. A different approach is to use a less detailed visual model, e.g., a color histogram, which can account better for the small changes of the target’s appearance. However, during tracking, the target’s appearance will still change slowly and a mechanism to adapt the reference visual model to these changes is still needed. Furthermore, the color-based tracking may degrade when the target is located on a cluttered background1, since the measurements provided by the color model become too ambiguous. In this chapter we will propose a color-histogram-based model of the target, which can use the information from the background to improve tracking performance, and can adapt to the temporal changes of the target’s appearance. The outline of this chapter is as follows. In Section 3.1, a histogram-based appearance model is presented and a new measure of presence which uses the background information is developed. In section 3.2, the probabilistic model of the proposed measure of presence is derived. In Section 3.3 we propose a principle to harvest additional information from the background to mask out 1The term background clutter is used throughout this thesis to refer to the visually-similar background pixels near the tracked object. For example, when the target’s texture is very similar to the texture of the background, we say that the background clutter is severe. 41 42 Color-based tracking Figure 3.1: The images (a,b,c) show persons in different poses with their torsos approximated by an ellipse. Image (d) shows an ellipse approximating a palm. pixels that do not belong to the target. In Section 3.4 we propose and discuss a scheme for adaptation of the visual model to the changes in target’s appearance. A color-based probabilistic tracker is presented in Section 3.5 and results of the experiments with the tracker are presented in Section 3.6. This chapter is summarized in Section 3.7 3.1 Color histograms The color histograms encode the color statistics of a given region, which contains the target, by constructing a non-parametric model of the color distribution within that region. As we have seen in the related-work (Section 1.1.1) various models that encode the outline of the target have been proposed in the literature, which could be used for encoding the region of interest. However, these models usually have to be built for a specific class of objects and do not generalize well to other objects which may be of different shapes. Another drawback of the shape models is that they require the tracked objects to be sufficiently large in images so that enough edge information can be obtained. Another drawback is that they are prone to failure when the input images are noisy and the tracked object changes its shape rapidly. We thus consider an ellipse as a less detailed model for encoding the region containing the object, which can approximate fairly well various orientations of human body as well as body parts such as palms. For example, see Figure 3.1. In our application, the state of the tracked object xk is thus parameterized by an ellipse xk = (xk, yk, ak, bk) with its center at (xk, yk), and with parameters ak and bk denoting the width and the height, respectively. 3.1 Color histograms 43 When constructing the color histogram, it is beneficial to assign higher weights to the pixels that are closer to the center of the ellipse and lower weights to those farther away. This can help achieve some robustness in the appearance model since the pixels that are closer to the center are less likely to be affected by the clutter stemming from the background pixels or the nearby objects. Furthermore, if some a-priori knowledge of which pixels are not likely to belong to the target is available, it should be used in the construction of the histogram, i.e., those pixels should be ignored. An elegant way of discarding the pixels that do not belong to the target is to use a mask function, which assigns a prior zero weight to those pixels that are not likely to have been generated by the target and a weight one to all the other pixels. Let E = (x,y,a,b) be an elliptical region at some state x = (x,y,a,b). The RGB color histogram with 5 = 8x8x8 bins hx = {hi}?=1, sampled within the ellipse E, is then defined as hi = fh ^ K(u)M(u)8i(b(u)), (3.1) uS-E where u = (x,y) denotes a pixel within the elliptical region E. &(•) is the Kronecker delta function positioned at histogram bin i, and 6(u) G {1...B} denotes the histogram bin index associated with the color of a pixel at location u. K(-) is an Epanechnikov weighting kernel [165], as in [35, 118], positioned at the center of the ellipse, M(u) is the a-priori binary mask function, and fh is a normalizing constant such that Y^Lihi = 1. For an illustration of the weighting kernel, the mask function and the principle of sampling a histogram, see Figure 3.2. 3.1.1 Color-based measure of presence To localize an object during tracking, we require a measure of presence which provides a belief that an object with some reference histogram hfc is located at a given state. When some additional information about the color of the background is available, it should be used in the evaluation of this belief. Indeed, a body of literature exists on modelling the background using more or less complicated statistical models and on how to use these to discern the target from the background. However, to be more general, we assume that only a simple model is available, e.g., an image of the background without objects. 44 Color-based tracking (a) (c) (b) (d) Figure 3.2: The target’s state is modelled as an elliptical region (a). The weighting kernel and its parameters are sketched in (b). The mask function which masks out the background pixels is shown in (c). The target histogram is sampled by considering only pixels which are visible through the mask function and assigning a weight to each pixel according to the weighting kernel (d). We thus define the measure of the presence which evaluates whether a target with a predefined reference histogram hk is present at some state xk as D(hA,hib;hß) = /3-1p(hA,hib;hß), (3.2) where hA and hB are histograms sampled at the state xfc on the current and the precalculated background image, respectively. ß is the ratio between the number of pixels within the elliptical region of xfc that are assigned to the foreground by the mask function M(u) and the number of those assigned to the background. p(hA, hfc; hß) is the normalized distance between hA and hk, given the background histogram hB, defined as g(hAlhk) p(hAlht;hB) (3.3) ^g(hBlhk)2 + g(hAlhk)2 where ^(hi,h2) = 1 - ^V^J^i is the Hellinger distance [11, 137, 154]. Note that the Hellinger distance is related to the well-known Bhattacharryya coefficient [74], it is a metric and has a geometrical interpretation. For a detailed discussion on this distance please see [11, 154]. _ 3.2 The likelihood function 45 (a) (b) (c) Figure 3.3: The histogram of a basketball player was sampled within the ellipse (a). A non-normalized distance g(hA,ht) calculated at different positions around the player is shown in (b). The result for the proposed normalized distance measure p(hA,ht;hB) is shown in (c). For better visualization, one minus the distance measures are shown. The correct position of the player is depicted by a white arrow and a circle in each image. Notice how the mode corresponding to the selected player is more pronounced with respect to the background clutter when the normalized distance is used (c). The normalization term in (3.3) incorporates the distance between the reference color model and the background color. Such a normalization aids tracking when the target’s color is similar to the background. In these situations the measure (3.3) favors those regions for which the reference color model is closer to the color in the current image than to the background color. In practice, when using a particle filter, this effectively attenuates the background clutter and forces particles closer to the target. An example of the normalized and non-normalized distance measure is shown in Fig. 3.3 3.2 The likelihood function To carry out the update step of the particle filter, e.g., (Algorithm 2.4), we require the probability density function (pdf) of the presence measure (3.2). Due to the lack of a rigorous physical background with which the analytical form of this pdf could be derived, an experimental approach was chosen instead. The pdf of (3.2) was estimated from a large number of examples of moving persons; see Fig. 3.4a for examples. Some of these persons were tracked using a 46 Color-based tracking @ E (a) • • °D(h"ht;hB) °'6 (b) Figure 3.4: The left-hand image shows examples of persons which were used to estimate the empirical probability density function of the measure (3.2). The right-hand image shows this function in the form of a histogram, and overlaid is the maximum-likelihood fitted gamma probability density function. simple tracker from the literature [118]. In cases when the simple tracker failed, we have resorted to manual marking. This enabled us to obtain approximately 115,000 values of the measure (3.2), which are visualized by the histogram in Fig. 3.4b. To identify the best model for the gathered data, a model selection was carried out using the Akaike information criterion (AIC) [3] among four models of the probability density functions: exponential, gamma, inverse gamma and zero-mean Gaussian2. The test with the AIC showed that the gamma function explained the data significantly better than the other functions. For this reason the probability density function of measure (3.2) was chosen in the form of p(yt\xt) oc D(hA, ht; h)^e----V"*". (3.4) The parameters 71 and 72 were estimated from the data using the maximum-likelihood approach. The estimated values were 7l = 1.769 and 72 = 0.066. For more details on the model selection results, please see the Appendix A. 2Only the main results of the model selection using Akaike information criterion are reported here and the reader is referred to Appendix A for more details. 3.3 The background mask function 47 Note that the gamma distribution assigns small probability values to those values of the measure (3.2) that are very close to zero. At first glance this may not seem reasonable for the purposes of object localization; however, if we observe a moving nonrigid object such as a person in two consecutive time-steps, it is more likely that the person’s appearance will change within these two time-steps than stay the same. This is an inherent property of the visual dynamics of nonrigid objects and is implicitly captured by the likelihood function (3.4). 3.3 The background mask function While color histograms are powerful color cues for tracking textured objects, they can fail when the object is moving on a similarly textured background. This is usually due to their inability to capture the spatial relations in the texture, and the fact that they are always sub-sampled in order to increase their robustness. There is, however, still some useful information left in the current and the background image - the difference between the two. By thresholding this difference image with some appropriate threshold Kk, we can construct a mask image that filters out those pixels which are likely to belong to the background. Since, in general, the illumination of the observed scene is non-uniform in space and time, the threshold has to be estimated dynamically for the tracked object. We base our method for mask generation on the following observation. When an object described by an ellipse is located on the visually-similar background, some small portion of the pixels within the ellipse come either from the background, or come from the object but are very similar to the background. We thus assume that in those situations we can assign some percentage r]0 of the pixels within the object’s ellipse to the background. Let xfc denote the estimated state of the target at time-step k. Let hA and hB be the histograms sampled at that state on the current image A(-) and the background image B(-), respectively. The current mask function is defined as f 1 ; \\A(u) - B(u)\\ \ 0 ; otherwise MD(u)= 1 ; UH-B(u)\\>nk 0 ; otherwise ' ( . ) where u is some pixel, || • || is the L2 norm and Kk is the threshold in the current time-step. 48 Color-based tracking Note that we have to generate the mask function only when the tracked object is similar enough to the background. Thus in practice we verify after each tracking iteration the similarity between the target’s visual model and the background. If this similarity is within a predefined bound (g(hA, hB) < Qthresh), then the mask is generated in the next time-step. The threshold Kk+l for the next time-step is estimated as the threshold that would in the current time-step produce a mask function such that a predefined percentage r]0 of the pixels within the ellipse of the current state xfc would be assigned to the background. Otherwise, if g(hA, hB) > Qthresh, the mask is not generated in the next time-step and MD(u) = 1 in (3.5) for all pixels u. 3.3.1 Implementation of dynamic threshold estimation The procedure for calculating the appropriate Kk+l is as follows. Let dE = {diE}T=i be a set of pixel-wise intensity differences between the current image A(-) and the background image B(-) calculated within the ellipse E of the estimated state xfc and let these differences be ordered in an ascending order. We define the cumulative function corresponding to the ordered differences as 1 QE = V-----, (3.6) and then the smallest / at which c{l+l)E exceeds r]0 is the required threshold «t+i = K*M, or formally Kt(Vo) = diE ; clE < Vo < c{l+l)E. (3.7) The parameters r]0 and gthreah were estimated empirically by manually selecting persons as they moved on a heavily cluttered background. We have observed that usually when a person is located on a cluttered background, at least 25 percent of pixels within the ellipse describing that person can be assigned to the background and thus we have chosen r]0 = 25%. The parameter which decides when to generate the mask was set in a similar empirical manner, and was set to Qthresh = 0.8. The procedure of threshold approximation is illustrated with the Example 2. Example 2. Consider a situation where we observe a yellow person on a yellow background (Figure 3.5a). Assume that at time-step k - 1 we know the state of 3.4 Adaptation of the visual model 49 that person described by the ellipse E (Figure 3.5a) and the background image is available (Figure 3.5b). We want to determine a threshold Kk(r]0) such that 25 percent of pixels within the ellipse will be masked out by the resulting mask function. We thus first calculate the ordered set of intensity differences between the current image and the background within ellipse E in form of a histogram of differences (Figure 3.5c). The corresponding cumulative function is shown in Figure 3.5d. From the cumulative function we read off the difference value at which the function exceeds r]0 = 0.25 and we have Kk(0.25) = 55. For illustration, the mask function calculated from the current image and the background image is shown in Figure 3.5e and the masked image of the person is shown in Figure 3.5f 3.4 Adaptation of the visual model When a nonrigid object such as a human body moves through an observed scene, its texture varies due to the non-uniform lighting conditions, influences of the background, and variations of the person’s pose. Therefore, the color model, i.e., the person’s current reference histogram hfc, has to be able to adapt to these changes. In addition, if the current state of the tracked person is likely to have been falsely estimated, and the corresponding ellipse does not fully contain the person, then the reference histogram should be updated by a very small amount, or not at all. Otherwise, it should be adapted by some larger amount. Let xfc be the estimated state of a person at the current time-step. The histograms hA and hB are sampled at that state on the current and the background image, respectively. The adaptation equation then follows a simple auto-regressive form hfc = afchA + (1-afc)hfc_1, (3.8) where hk_x is the reference histogram from the previous time-step. The intensity of the adaptation is defined with respect to the normalized distance (3.3) between hA and hfc_! as «fc = ?max ¦ (1.0 - p(hA, hfc_i; hB)), (3.9) where ?max denotes the maximal adaptation. In all experiments in this thesis we use ?max = 0.05. This means, that, at each time-step, we allow at most 5% adaptation of the reference histogram. The following example was designed to 50 Color-based tracking SXE 1 ° B B(-) r\E Ik (a) die (c) (e) SO 100 150 200 250 300 360 400 (b) die (d) (f) 50 200 250 300 350 4O0 Figure 3.5: Example of estimating the threshold Kk at r]0 = 0.25. A person denoted by an ellipse is shown in (a) and the background image is shown in (b). The histogram of intensity differences is shown in (c) and the resulting cumulative function is shown in (d). The threshold corresponding to r]0 = 0.25 is found at diE = 55 in (b). The resulting mask function is shown in (e), and (f) shows the mask function superimposed over (a). 3.4 Adaptation of the visual model 51 provide some insight into the choice of maximal adaptation value. For better illustration we will assume that the parameter ak is constant and equal to the maximal adaptation, i.e., ak = 0.05. Example 3. Assume we observe a controlled environment with a single object illuminated by a white light. At time-step k = 0 we record its reference histogram h0 and then turn on color lights such that the apparent color of the object changes. The new color histogram of the object is hc and remains constant for all k > 0. The model hk begins to adapt to the new histogram hc according to (3.8) hfc = (1 - afc)hfc_i + akhc. Since we assume that ak and hc are constant, we can rewrite the reference histogram at time-step k as hfc = (1 - ak) k h0 + (1 - (1 - ak) k )hc. (3.10) Now say we want to know what portion of the new histogram hc has been incorporated into the current reference histogram h k after 25th time-step3 at a constant adaptation value ak = 0.05. Using (3.10) we have h25 = (1 - 0.05)25h0 + (1 - (1 - 0.05)25)hc « 0.28h0 + 0.72hc, which means that after 25 time-steps the reference histogram hfc contains approximately one quarter of the reference histogram at k = 0, h0, and three quarters of the new histogram hc. A further insight into the influence of different values of ak is provided in Figure 3.6(a). The graph shows the percentage of the histogram hc in hk after 25th time-step for different values of ak. We see that by setting ak = 0.2, the reference histogram becomes practically equal to the current histogram after 25th time-step. This means that at ak = 0.2, and at frame rate of 25 frames per second, the reference histogram completely adapts to the instant change within the time-span of one second. Figure 3.6(b) shows percentage of the new histogram in hk at a constant adaptation ak = 0.05 with respect to time. We see that the reference histogram would completely adapt to an instant change of person’s appearance within 120 time-steps, which, at frame rate of 25 frames per second, amounts to 5 seconds. 3Note that, in most of our experiments reported in this thesis, 25 frames corresponds to one second of video. 52 Color-based tracking °-9" / 0.8 - / 0.7- 0.6 - / °-5 " I °-4 "I 0-3 - 0.2 0.1 „I------------------------------------------------------------------------ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 a (a) Figure 3.6: Percentage of the new histogram in the reference histogram after 25 time-steps with respect to constant adaptation parameter a (a). Percentage of the new histogram in the reference histogram at a = 0.05 with respect to number of time-steps is shown in (b). 3.5 Color-based probabilistic tracker In the previous sections we have presented an adaptive color-histogram-based visual model of the target, which is able to harvest information about the color of the background with the aim to better discriminate the target from the background. In this section we show how the proposed color model can be used for tracking within the framework of particle filters. Prior to tracking, we first calculate the background image. This can be achieved for example, either by taking a single image of an empty scene, or to record a sequence of images and then construct a median image pixel-wise along the temporal component. The tracker is initialized by selecting the target and recording the reference histogram. This can be done, for example, by manually clicking the target. Then at each tracking iteration the following steps are executed. First, a mask function is calculated using the dynamically estimated threshold from the previous time-step, as discussed in Section 3.3. Then an iteration of the bootstrap particle filter (Algorithm 2.4) is executed using the likelihood function derived in Section 3.2. The current estimate xfc of the target state (position and size of the ellipse) is calculated as a MMSE estimate (2.32). A histogram is then sampled at the estimated state xfc and used to adapt the 0.9 0.8 0.7 0.3 0.2 0.1 Number of observed images (b) 3.5 Color-based probabilistic tracker 53 reference histogram according to Section 3.4. Finally, in the last step of tracking iteration, the threshold for generating the mask image in the next time-step is calculated if necessary according to Section 3.3. This procedure is summarized in Algorithm 3.1. Initialize: • Calculate the background image, e.g., pixel-wise by means of a median filter along temporal axis. • Initialize the tracker by selecting the target (e.g., manually). Tracking: 1. For i = 1 : N, • Calculate the mask function according to Section 3.3. • Execute an iteration of the bootstrap particle filter (Algorithm 2.4) using the likelihood function from Section 3.4. • Estimate the current state x^t by MMSE estimate (2.32) from the particle filter. • Sample the histogram at x^t and adapt the model to that histogram as in Section 3.4. • If required, estimate the threshold for the mask function MD(u) in the next time-step (Section 3.3). Algorithm 3.1: Color-based probabilistic tracker. Note that the proposed color model and the probabilistic tracker in Algorithm 3.1 require us to set some parameters. These are: the parameters of the likelihood function, dynamic background subtraction and the parameters for adaptation of the reference histogram. In the previous sections, we have discussed how these parameters have been selected, and in all the following experiments in this thesis their values are kept constant. For a better overview, we summarize them in Table 3.1. 54 Color-based tracking Table 3.1: Parameters of the color-basedprobabilistic tracker. Parameters of the color-based likelihood function (3.4) • (71,72) = (1.769,0.066) Parameters for dynamic background subtraction (Section 3.3.1) • (r?o,Pthresh) = (0.25,0.8) Maximal adaptation of the color model (3.9) • ?max = 0.05 3.6 Experiments A set of experiments was performed to evaluate how the proposed likelihood function (Sect. 3.2), the adaptation scheme (Sect. 3.4) and the dynamic background subtraction (Sect. 3.3) influence the performance of tracking persons in images. In these experiments we have compared the proposed color-based tracker from Algorithm 3.1, we denote it by Tcol, to a reference tracker Tref from the literature [118]. The reference tracker Tref was also a color-histogram-based particle filter, however, Tref did not account for the background in the likelihood function and the adaptation, and did not use the dynamic background subtraction scheme. Both trackers used 50 particles in the particle filter. The target motion was modelled by two independent nearly-constant-velocity (NCV) models in the horizontal and vertical direction. The dynamics of the target’s size was modelled by two independent random-walk (RW) models in horizontal and vertical direction. For a detailed description of the NCV and RW models, and assumptions they enforce on dynamics, see Appendix B.1 and Appendix B.2, respectively. The parameters of the NCV models were set for each set of the experiments separately according to the expected size of the tracked object in video. The parameters of the RW models were set such that the target’s size would change approximately at most by 15 percent in between consecutive timesteps. Three experiments were considered for comparing performance of Tcol and Tref . The first experiment considered tracking a person in a heavily cluttered environment from a bird’s-eye view. The second experiment considered tracking a person from a side-view in a less cluttered environment, in a situation where that person is occluded by another, visually similar, person. The last, the third, experiment considered tracking a person from a side-view moving camera. In 3.6 Experiments 55 each experiment, a single person was manually initialized and tracked throughout the sequence. If the person was lost during tracking, the tracker was manually reinitialized and tracking continued. In all experiments, the background image was calculated prior to tracking. This was done by calculating the mean value of each pixel along its temporal axis over entire sequence of images. Experiment 1: Tracking in a severe clutter The first experiment considered tracking a 12 × 12 pixels large goal-keeper on a heavily cluttered background in a 773 images long sequence of a handball match (Figure 3.7a). On average, the tracker Tcol required only one user intervention during tracking, while the tracker Tref required approximately 14 interventions to maintain a successful track throughout the sequence. Note that the goal-keeper was visually very similar to the goal-area, however, Tcol was still able to maintain a good track. The only time that Tcol failed was when the goal-keeper was located in front of the goal-pole (Figure 3.7c). At that point he was simply too similar to the background and the tracker drifted to his legs, which was considered as a loss of track. (a) (b) (c) Figure 3.7: A blue goalkeeper on a heavily cluttered background (a) is depicted by a white arrow. The background image is shown in (b). The single situation where Tcol failed by drifting to the goal-keeper’s legs is shown in (c); the white ellipse depicts the falsely estimated position. Experiment 2: Tracking with occlusion The second experiment was conducted on a 600-images-long sequence taken from Pets2006 [127] surveillance scenario (see Figure 3.8). This experiment considered 56 Color-based tracking tracking a person dressed in black, which was walking from the left part of the scene to the right, stopped, waited there for a while, and then continued walking back to the left. At the end of the sequence, the person was occluded once by another visually similar person walking in the opposite direction (Figure 3.9a, middle column). On average, the reference tracker Tref required approximately 5 user interventions during tracking when the person was standing still on a dark background (e.g., Figure 3.8b), and another intervention when the person was occluded by the other person ; this situation is shown in Figure 3.9b. On the other hand, the proposed tracker Tcol was able to track the person throughout almost the entire sequence. However, when the person got occluded by another person of similar color, the measurements became too ambiguous and tracking failed (see, for example, Figure 3.9c). (a) (b) (c) Figure 3.8: Examples of surveillance video used in the Experiment 2. In images (a,b) the tracked person is depicted by a white arrow. Image (c) shows the approximation of the background image, which was used in the proposed tracker Tcol. Experiment 3: Tracking from a moving camera The third experiment considered tracking the upper body of a handball player in a 700-images-long sequence, which was recorded from a moving camera (see, for example, Figure 3.10). The tracked player was dressed in yellow and was thus visually similar to some parts of the court. Figure 3.10c shows the approximation to the background image that was used by the proposed tracker Tcol. Note that since the camera was not static the image is not the actual background image. However, it still captures some salient visual properties such as the color of the court (lower part of Figure 3.10c) and the color of tribune (upper part of Figure 3.10c). 3.6 Experiments 57 (a) (b) (c) Figure 3.9: Images show frames 627, 638 and 664 from the surveillance video, used in the Experiment 2, in which a person is occluded by another, visually similar person (a). Results of tracking using the reference tracker Tref and the proposed tracker Tcol are shown in (b) and (c), respectively, with the estimated locations of persons depicted by white ellipses. The proposed tracker Tcol was able to use the information provided by the approximation of the background to prevent losing the player and drifting to the floor. Figure 3.11a shows such a situation, where the tracked player was moving partly on the yellow part of the court. While Tcol was able to keep track of the player (Figure 3.11c), Tref failed (Figure 3.11b). In addition to the moving camera, the tracked player was in several occasions occluded by other, differently colored, players. In most cases this caused a failure of Tref, while, again, Tcol was still able to track the player through the occlusion. One such situation is shown in Figure 3.12. On average, the tracker Tref required approximately 8 interventions 58 Color-based tracking to maintain a successful track. On the other hand, Tcol comfortably tracked the player throughout the entire recording. (a) (b) (c) Figure 3.10: Examples of the video recording used for tracking from a moving camera (Experiment 3). Images (a,b) show the yellow player (depicted by a white arrow) which was tracked. Image (c) shows the approximation of the background image used by the proposed tracker. 3.7 Conclusion In this chapter we have proposed several improvements of the color-based tracker. The first was the color-based measure of the target’s presence (Section 3.1.1) that uses information from the approximation of the background image to reduce the influence of the background clutter. Using model-selection methodology and the maximum-likelihood estimation we have proposed the likelihood function (Section 3.2) which can be used to probabilistically interpret values of the proposed target’s presence measure. However, in cases when the target is moving on those parts of the background that are very similar to the color of the target, the proposed measure of presence may not be discriminative enough. For that reason we have considered the background subtraction, i.e., generating a mask function with aim to mask out pixels in the current image that do not belong to the target. In situations where lighting of the scene is changing, or the camera is moving or shaking, it is usually difficult to obtain an accurate model of the background. For that reason we have considered using only a simple approximation to the background and proposed a dynamic background subtraction (Section 3.3), which is our second improvement of the color-based tracker. The mask function is generated by evaluating the similarity between the tracked target and the background model and is thus specialized to the tracked 3.7 Conclusion 59 (a) (b) Figure 3.11: Images show frames 347, 356, 364 and 378 from a video in which the yellow player moves on a yellow background (a). The reference tracker drifted to the floor and tracking failed (b), while the proposed tracker still maintained a successful track (c). The estimated player’s location is depicted in each image by a white ellipse. target. The third improvement was the selective adaptation of the target’s visual model (Section 3.4), which is used to guard against updating the color-based visual model in situations where the position of the target is falsely estimated, or when the target is occluded by another object. We have also shown how these improvements are probabilistically combined within the framework of particle filters into a color-based probabilistic tracker. Experiments were conducted with tracking people from different views using a static and moving camera. The proposed tracker was compared to a reference tracker that was conceptually similar to our own tracker, however, it did not utilize the proposed improvements. Results have shown that the proposed tracker resulted in a more stable tracking, requiring significantly less user interventions than the reference tracker in situations when the target was moving on a heavily 60 Color-based tracking (a) (b) (c) Figure 3.12: Images show frames 589, 618, 633 and 645 from a video where the yellow player is occluded by the other, differently colored, players. During the occlusion the reference tracker failed to track the yellow player (b), while the proposed tracker maintained a successful track (c). The tracker-estimated player’s location is depicted in each image by a white ellipse. cluttered background. The experiment of tracking a person from a moving camera showed that the proposed tracker outperformed the reference tracker even though the background image could not be accurately estimated. While the proposed tracker was able to cope with short-term occlusions when the tracked person was occluded by a differently colored person, its performance degraded in situations where the color of the occluding person was visually-similar. In the next chapter we will propose a solution which can help resolving such situations. We have gotten stuck half-way in our transition from the planned and command economy to a normal market economy. We have created... a hybrid of the two systems. Boris N. Yeltsin (1931 - 2007) Chapter 4 Combined visual model The color-based probabilistic visual model, which was proposed in the previous section can significantly increase the tracking performance in cases when the color of the background is similar to the color of the tracked object. However, an inherent drawback of the color-based (or edge-based for that matter) models is that tracking may fail whenever the target gets into a close proximity of a visually similar object. We have observed one such situation in the second experiment of the previous chapter. There we were tracking a black person through occlusion by another black person. In that situation, the visual information provided from the color-based visual model became too ambiguous and after occlusion was over, the tracker continued to track the wrong person. In applications such as video surveillance, visual human-computer interface and tracking in sports, the camera is sometimes positioned such that the scene is viewed from the side only. In these situations, complete occlusions between visually similar objects are quite frequent, which is bound to deteriorate the tracking performance of any color-based tracker. In this chapter we argue that ambiguities which arise from the visual similarity between different objects can be resolved to some extent if we account not only for the target’s color but also its low-level temporal component of the visual information – the optical flow. We illustrate our argument in Figure 4.1 with an example of tracking person’s right hand after it has been occluded by the left hand. Figure 4.1d shows the likelihood function corresponding to the reference color model of the hand. Observe that the mode of the likelihood function stretches over both hands. This is causing a persistent ambiguity in the hand’s position and it is very likely that tracking will fail. Now assume that we calculate 61 62 Combined visual model the optical flow in the image (Figure 4.1b) and that we know the tracked hand is moving to the left. If we now visualize the likelihood function reflecting which flow vectors in (Figure 4.1b) are pointing to the left, we get the local-motion likelihood function in Figure 4.1e. Note that while one of the modes of the function corresponds to the tracked hand, the other hand is hidden since it is not moving to the left. The local-motion likelihood function assigns significant likelihood also to some other parts of the image which do not correspond to the tracked hand and is on its own also introducing some ambiguity in the position of the tracked hand. But if we combine (multiply) the color likelihood with the local-motion likelihood, we get the combined likelihood function as shown in Figure 4.1f. Note that now only the mode corresponding to the tracked hand remains and that the maximum corresponds to the hand’s location (Figure 4.1c) – we have thus successfully resolved the ambiguity of the hand’s position. The outline of this chapter is as follows. In Section 4.1 we discuss a method for estimating the optical flow. Based on this method we define the local-motion feature in Section 4.2, where we also derive a probabilistic model for the local motion. The combined, color- and motion-based probabilistic tracker is derived in Section 4.3 and results of experiments are reported in Section 4.4. We summarize this chapter in Section 4.5. 4.1 Optical flow When an object moves through an observed environment, its motion is perceived by the camera as the changes in intensity of the points in a 3D space. If we assign a velocity vector to each of those 3D points, then we obtain the so-called motion field that tells us how each point in the scene is translating with respect to the camera (see [84], pages 278–285). An image is generated in the camera through a projection of the light rays of the 3D points onto the CCD sensor. While one source of changes in the intensity of 3D points is the motion field, the other is the changing illumination of the scene. These two sources induce temporal patterns of intensities at the pixels on the CCD sensor. The induced patterns are called the apparent motion, and an estimate of the apparent motion, which we calculate from a sequence of images, is called the optical flow. A widely used hypothesis for calculating the optical flow is that the brightness of a local structure (e.g., a pixel on a moving object) remains constant over 4.1 Optical flow 63 (a) (b) (c) (d) (e) (f) Figure 4.1: A person’s hand (depicted by the ellipse) is tracked after being occluded by the other hand (a). The local motion, the optical flow, induced by both hands is depicted in (b). The color likelihood and the local-motion likelihood are shown in (d) and (e), respectively. Bright colors correspond to high likelihood, while dark colors correspond to low likelihood. Image (f) shows the combined, color and motion, likelihood from (d) and (e). The position corresponding to the maximum likelihood in (f) is depicted by a white cross in (c). 64 Combined visual model consecutive images, while its location may change [16]. This is known as the data conservation constraint. However, since calculation of the flow vector from a single pixel is ill-posed, additional constraints are required. Alternatively, one can assume data conservation within a patch rather than a single pixel and then solve the optical flow vectors by least squares [103]; this is the basis of the well-known Lucas-Kanade method. However, since a nonrobust least-squares method is used in the Lucas-Kanade method, the resulting optical flow is still poorly estimated in regions of homogeneous intensity, since there the calculation is again ill-posed and very susceptible to noise. This drawback can be remedied to some extent by noting that the intensity patterns of the near-by pixels in the image tend to move similarly. The constraint which assumes similar velocities in the neighboring pixels is called the spatial coherence constraint. Horn and Schnuck have proposed a method for calculating the optical flow in [58] which simultaneously applies the data conservation constraint as well as the spatial coherence constraint. This method produces a better optical flow than the Lucas-Kanade method but at a cost of significantly increasing its computation time. In practice, however, both, the data conservation and spatial coherence constraints are violated in the presence of multiple motions, and since the underlying calculations are usually carried out using the least squares, the estimated optical flows may still contain errors. Some researchers [16] therefore replace the nonrobust least squares with robust estimators to handle multiple motions, or explicitly model the discontinuities by generative models [17]. Usually, calculations of flow vectors with respect to various constrains involve iterative procedures and are thus computationally very demanding. For that reason, Zach et. al. [173] have recently proposed efficient implementations that exploit features of specialized graphic cards to calculate optical flow with spatial and data conservation constraint in real-time. To illustrate how constraints influence the calculation of the optical flow we compare the optical flows estimated using the Lucas-Kanade approach [103] and Michael J. Black’s [16] multiple motion method in Figure 4.2. Note that while the Black’s method (Figure 4.2c) produces a much better optical flow than Lucas-Kanade (Figure 4.2b), the time for calculating the Black’s flow in Figure 4.2 was considerably longer than for calculation of the Lucas-Kanade. On a Celeron 1.5GHz, 500MB RAM laptop, a C++ implementation of the Black’s [18] method 4.1 Optical flow 65 (a) (b) (c) Figure 4.2: Comparison between the optical flow calculated using the Lucas-Kanade algorithm and using the M. J. Black’s multiple motion algorithm. The reference image is shown in (a), while results of the Lucas-Kanade and the Black’s method are shown in (b) and (c), respectively. required twelve seconds to calculate the flows of a 288 × 360 pixels images, while Lucas-Kanade [122] required only one second. Due to a considerable time consumption of the methods that involve multiple constraints, we will focus in the following on using simple methods like the Lucas-Kanade algorithm to obtain the information about the local motions in the image. Thus, rather than calculating the dense optical flow (i.e., optical flow vectors of each pixel) using a time-consuming method, we will identify which points in the image contain enough local texture, and calculate flow vectors only at those points. We will be interested in calculating a sparse flow rather than dense, which will allow us to use computationally less demanding algorithms for flow calculation. 4.1.1 Calculating the sparse optical flow There are two conceptual ways in which the optical flow at a given location in the image can be defined. One way is to consider from which location in the previous image the pixel in the current image has been translated. The other is to consider to which location in the next image the pixel in the current image will be translated. In our implementations we consider the first definition of the optical flow. Thus the optical flow at a given point is calculated by estimating the optical flow vector from the current image back to the previous image, and reversing its direction. This approach was chosen since it relates the current 66 Combined visual model (a) (b) (c) (d) Figure 4.3: Example of the optical flow calculation. Images (a) and (b) are consecutive images from a video. To calculate the flow in image (b), we first calculate the optical flow (c) at each point of the current image (b) back to the previous image (a). The direction of this flow is then reversed. The resulting flow overlaid over the current image (b) is shown in (d). For better visibility, only every fifth flow vector is shown. image to the previous image and is in better agreement with the way in which we will use the optical flow for tracking. An example of the described definition of the optical flow is shown in Figure 4.3. In our implementation, the optical flow is estimated using the pyramidal implementation [22] of the well known Lucas-Kanade method. The pyramidal implementation starts by first constructing a multi-resolution pyramid for a given pair of images, where at each level of the pyramid, images are resized to half of their size at the previous level. The optical flow vectors are initially estimated using the Lucas-Kanade method at the coarsest (the lowest) level, and used to initialize calculation of the flow vectors at a higher level. In this way, at each level, the flow vector estimate obtained from the previous level is refined. This allows for efficient calculation of the flow vectors in presence of small as well as large motions. As already discussed, the Lucas-Kanade calculation of the flow vector at a given pixel does not take into account the neighboring flow vectors, and the resulting flow field is usually noisy (see, e.g., Figure 4.2b and Figure 4.3d). Worse yet, this method fails to provide a reliable estimation of the flow vectors in regions with poor local texture. We therefore apply Shi-Tomasi feature detection [143] to determine locations with sufficient local texture, and calculate the optical flow 4.1 Optical flow 67 only at those locations. The Shi-Tomasi feature at location (x,y) is defined by the smallest eigenvalue of the covariation matrix of gray-scale intensity gradients, which are calculated in the neighborhood of (x, y). The location (x, y) is accepted as a valid Shi-Tomasi feature if the smallest eigenvalue exceeds a predefined threshold fth. An example of valid Shi-Tomasi features and the corresponding flow vectors is shown in Figure 4.4. Note that a majority of the flow vectors that correspond to the valid features in Figure 4.4e appear to be well estimated. (a) (b) (c) (d) (e) Figure 4.4: Two consecutive images from a video of a squash match are shown in (a) and (b). The optical flow estimated for image (b) using the Lucas-Kanade method is shown in (c). The valid Shi-Tomasi features are depicted by white color in image (d). The flow vectors from (c) that correspond to the valid Shi-Tomasi features in (d) are shown in (e). For clarity, only every third flow vector is shown. Using the sparse optical flow defined above, we can now derive in the following the local-motion feature, which will be used later to provide the motion information for a probabilistic tracker. 68 Combined visual model 4.2 Optical-flow-based local-motion feature Let vk(x,y) = [r,] be the optical flow vector at location (x,y) in the current image with amplitude r and orientation 0. Note that in the literature the optical flow vectors are usually written in cartesian coordinates. The reason why we use the polar notation instead is that in the following we treat the angle of the optical flow separately from its amplitude. The local-motion feature vE = [rE,E] of a region E is then encoded as the weighted average of the flow vectors ve = /,"1 Yl vk(x,y)K(x,y), (4.1) (x,y)€E' where E' G E is a set of detected Shi-Tomasi features within region E, K(x,y) is the Epanechnikov kernel [140] used to assign higher weights to those flow vectors that are closer to the center of E, and fv = L(*,*)€L' K(x,y) is a normalization term such that fvJ2{x,y)eE,K(x,y) = 1. To avoid the pathological situations associated with vectors with amplitude zero, the summation (4.1) is carried out in cartesian coordinates. 4.2.1 Local-motion likelihood Let vref = [rref, 0ref] be a reference vector, which models the target’s local-motion, and let vE be a local-motion vector calculated within region E. We define the angular and amplitude similarity G^ and Gr, respectively, between vref and vE as G,(vE,vref)= ; rE>5*Ar«*>5* , (4.2) 1 ; otherwise 4vg,Vref) 7T gp(vb,vref) = teS ; rE > Sth V rref > sth 0 ; otherwise such that G(-, •) G [0,1], Gr(; •) G [0,1], Z(-, •) is the angle between two vectors, • | is the Lx norm, and 5th is a threshold below which the vectors are considered as having amplitude zero. If region E contains no valid Shi-Tomasi features, the vector vE is undefined and the similarities are G^ = 1.0 and Gr = 1.0. We have observed, in a preliminary study [88], that the probability density functions (pdf) of (4.2) and (4.3) can be well approximated by exponential distributions. However, in practice we approximate the current reference motion 4.2 Optical-flow-based local-motion feature 69 using motions observed in previous time-steps. This may impair the quality of tracking whenever the target suddenly significantly changes its motion. To cope with such events, we introduce a uniform component to the probability density function. The joint probability density function of (4.2) and (4.3) with parameters 9=[X4>, \r,wnoise] is then defined as p(G^,Gr|#)oc(l-wnoise)e S vJ+wnoise, (4.4) where A^ and Ar are the parameters of the exponential distributions and 0 < wnoise < 1 is the weight of the uniform component. 4.2.2 Adaptation of the local-motion feature After each tracking iteration, the current state xfc of the target and its current velocity vfc are calculated, e.g., via the MMSE estimate (2.32) from the particle filter. The new region E containing the target is determined and the local-motion vector vEfc = [0Efc,rEfc] (4.1) is estimated. If the region E contains at least one valid Shi-Tomasi feature, then vEfc is used to adapt the reference local-motion model vref = [0ref,rref]. This is achieved by applying an autoregressive scheme 0+f = /Vfef + (1 - ß^k)(f)Ek: 4f = ßrkr;ef + (1 - ßrk)rEk, (4.5) where the subscripts (•)" and (•)+, respectively, denote the reference model prior and after the adaptation. The variables ß^ and ßrk are the current adaptation intensities ßi,k oc p(G^(vfc,vEfc),0|o), ßrk oc p(0,Gr(vfc,vEfc)|#), (4.6) where p(-,-\0) is defined in (4.4), and ß^ G [0,1], ßrt G [0,1]. If the region E does not contain any valid Shi-Tomasi features, then vEfc is undefined and the reference is not adapted. From (4.6) it follows that the reference local-motion model is adapted to the local changes in the target’s motion only when the velocity, with which the tracker predicts the target is moving, is approximately in agreement with the observed local-motion at the current estimated state. Otherwise the adaptation is low, since the target is probably being occluded by another object. 70 Combined visual model 4.3 The combined probabilistic visual model We derive the combined color/local-motion-based visual model by extending the color-based visual model from Chapter 3 to account for the local-motion. Under the assumption that the target’s color properties are independent of its motion, the likelihood function for the particle filter can be written as p(yfc|xfc) =p(yfccol|xfc)p(yfcmot|xfc), (4.7) where p(yfccoi|xfc) is the color likelihood at state xfc, and p(yfcmot|xfc) presents the local-motion likelihood at that state. Note that, in the case of the purely color-based visual model from Chapter 3, Tcol, the likelihood function is equal to p(yfcCoi|xfc). The combined color/local-motion-based visual model, we denote it by Tcmb, is then obtained by replacing the likelihood function in Tcol by (4.7) and setting p(yfcmot|Xfc) =p(G^(vXfc,Vref),Gr(vXfc,Vref)|#). (4.8) In the equation above, p(-,-\0) is defined in (4.4), vXfc is the local-motion (4.1) sampled at state xfc, and vref is the reference local-motion. While, during tracking, the color histograms are sampled within the elliptical regions of the hypothesized states x^, we have found that, in practice, it is sufficient to sample the local-motion feature (4.1) within the rectangular regions superimposed over the ellipses. The combined color/local-motion-based probabilistic tracker can be derived from purely color-based tracker (Algorithm 3.1) by using the likelihood function defined in (4.7, 4.8) and the local-motion adaptation scheme from section 4.2.2. The combined probabilistic tracker is summarized in Algorithm 4.1. 4.4 Experiments Several experiments were conducted using examples from surveillance, sports tracking and hand tracking (Figure 4.5) to compare the proposed combined tracker Tcmb from Algorithm 4.1 to the purely color-based tracker Tcol proposed in Algorithm 3.1. Both trackers used 50 particles in the particle filter and a nearly-constant-velocity dynamic model. All recordings were taken at the framerate of 25 frames per second, except for the recording which was used for hand tracking; that recording was taken at 30 frames per second. 4.4 Experiments 71 Initialize: • Initialize the tracker by selecting the target. (e.g. manually) Tracking: • For k = 1,2,3,... 1. Execute an iteration of the color-based particle filter from Algorithm 3.1 using the likelihood function p(yk|xk) defined in (4.7) and the current reference local-motion vref . 2. Estimate the current MMSE state x^k and the current velocity v^k using (2.32). 3. Estimate the new reference vref according to section 4.2.2. Algorithm 4.1: The combined color/local-motion-based probabilistic tracker. The Shi-Tomasi feature detection from section 4.2 was performed using 3 x 3 pixels neighborhoods and only features whose smallest eigenvalue exceeded Lth = 10-3 were accepted. The size of the integration window in the Lucas-Kanade optical flow calculation was set to 9 x 9 pixels. The amplitude threshold used in (4.2) and (4.3) was set to 5th = 10-2 pixels. In the experiment with hand tracking, a two-level pyramid was used to calculate the optical flow. The pyramids were not used in the other experiments. The parameters of the local-motion likelihood function (4.7) were set experimentally to A^ = 0.1, Ar = 0.3 and wnoise = 0.01. Note that, since Ar was chosen to be larger than A^, the amplitude of the local motion had a smaller impact on the value of the likelihood function in comparison to the angle. The reasoning behind this is that during an accelerated movement, typical for hands and people, the amplitude of the optical flow changes more significantly than its direction. Note that, with the exception of the number of the pyramid levels, all parameters of the combined visual model were kept fixed for all experiments. We summarize these parameters in Table 4.1. 72 Combined visual model (a) (b) (c) (d) Figure 4.5: Images from the recordings used in the experiments with surveillance (a,b), sports tracking (c) and hand tracking (d). Table 4.1: Parameters of the combined-visual-model probabilistic tracker. Shi-Tomasi features (Section 4.1.1) • Feature size: 3x3 pixels • Threshold on smallest eigenvalue: Lth = 10-3 Lucas-Kanade optical flow (Section 4.1.1) • Integration window size: 9x9 pixels Local-motion likelihood function (Section 4.2.1) • Minimal amplitude of local motion (4.2, 4.3): 6th = 10-2 pixels • Likelihood parameters (4.4): A^ = 0.1, Ar = 0.3, wnoise = 10-3 Experiment 1: Tracking in a dynamic clutter The aim of the first experiment was to demonstrate how the proposed combined visual model can resolve situations when a person moves in a dynamic clutter. 4.4 Experiments 73 A recording from PETS 2006 database [127] (Figure 4.5a) was chosen, which contained a black person walking in front of a group of black persons. This group constituted the so-called dynamic clutter, since this clutter did not come from a static background. The size of the person in the video was approximately 13×30 pixels. The person was manually selected and tracked with Tcol and Tcmb. The results of tracking with Tcol are shown in Figure 4.6a. Even though the person was walking in front of the group, the purely color-based tracker Tcol could not discern the person from the group when they came into contact due to their visual similarity, and tracking failed. Note that it is indeed difficult even for a human observer to discern the tracked person from the others by solely looking at, for example, 155th frame in Figure 4.6a. On the other hand, the proposed combined visual model in Tcmb was able to make use of the optical flow information, and successfully tracked the person throughout the contact (Figure 4.6b). (a) (b) Figure 4.6: Images from the recording used in the experiment of tracking in a dynamic clutter. The upper row (a) shows the results for tracking with the purely color-based tracker Tcol, while the results for the proposed Tcmb are shown in the bottom row (b). In all images, the current estimated position of the person is depicted by the white ellipse, while the estimated velocity is depicted by the cyan line. In the lower-row images (b), the red line depicts the local-motion feature calculated at the estimated position, while the green line shows the current model of the local motion. 74 Combined visual model Experiment 2: Tracking a person through occlusion In the second experiment we have reconsidered the surveillance scenario (Figure 4.5b) which was used in the previous chapter to evaluate the performance of the color-based probabilistic tracker; see Section 3.6. In that scenario, a person was tracked as he moved from left part of the scene to the right, and back to the left. While he was walking back to the left, he was occluded for a short duration by another visually-similar person. Recall that the tracker which used the color-based visual model proposed in Chapter 3 was able to track that person even when the person was located on a cluttered background. However, when that person was occluded by another person, the tracker failed since the measurements became too ambiguous. The results of tracking with the proposed combined tracker Tcmb and the purely color-based model in tracker Tcol are shown in Figure 4.7. Figure 4.7a shows that Tcol fails during the occlusion, which is due to a significant visual ambiguity of the person’s position. On the other hand, Tcmb is able to make use of the local-motion, and does not fail (Figure 4.7b). In particular, the local-motion indicates that the apparent motion of the tracked target should point to the left. Since the tracked person and the occluding person are moving in the opposite directions, this helps resolve the visual ambiguity and allows the tracker to maintain a lock on the correct target. Experiment 3: Tracking a person through multiple occlusions To demonstrate how the proposed tracker with a combined model performs in light of multiple occlusions between visually-similar persons, who rapidly change their direction of movement, we have considered an example of tracking a player in a squash match (Figure 4.5c). The tracked player was approximately 25 × 45 pixels large and was occluded 14 times by another visually-similar player. The player was tracked five times, and the average number of times that the tracker failed was recorded. The purely color-based tracker Tcol failed on average twelve times, while Tcmb failed on average three times. Figure 4.8a shows five frames from the recording where, after the occluded player appears (t = 418), the visual information becomes ambiguous, since both players wear white shirts, and Tcol fails (t = 425). On the other hand, Tcmb successfully utilizes the local-motion information to resolve this ambiguity, and tracks the correct player even after the occlusion (Figure 4.8b). 4.4 Experiments 75 (a) (b) Figure 4.7: Images from the recording used in the experiment of tracking through an occlusion. The upper row (a) shows the results for tracking with the purely color-based tracker Tcol, while the results for the proposed Tcmb are shown in the bottom row (b). In all images, the current estimated position of the person is depicted by the white ellipse, while the estimated velocity is depicted by the cyan line. In the lower-row images (b), the red line depicts the local-motion feature calculated at the estimated position, while the green line shows the current model of the local motion. Experiment 4: Tracking person’s palms In the fourth experiment we have considered a recording of a person waving his hands (Figure 4.5d) in front of the camera. The hands were approximately 20×20 pixels large, and were tracked independently of each other. They occluded each other 17 times with majority of occlusions occurring in front of the person’s face. The reference tracker Tcol failed 26 times, by either following the wrong hand or the face after the occlusion. The combined tracker Tcmb resolved a majority of occlusions, and failed only four times by losing the hand and locking onto the person’s face. A detailed inspection of the results showed that, in the situations where Tcmb failed, the target’s color model was strongly favoring the face, while the local-motion feature at the edge of the tracked hand still supported the target’s reference motion. The reference motion model deteriorated, which caused the tracker to drift from the hand to the face. Figure 4.9 shows an example where Tcol lost the hand after it was occluded by another hand (Figure 4.9a), while Tcmb resolved the occlusion (Figure 4.9b). 76 Combined visual model (a) (b) Figure 4.8: Images from the recording used in the experiment of tracking a person through multiple occlusions. The upper row (a) shows the results for tracking with the purely color-based tracker Tcol, while the results for the proposed Tcmb are shown in the bottom row (b). In all images, the current estimated position of the person is depicted by the white ellipse, while the estimated velocity is depicted by the cyan line. In the lower-row images (b), the red line depicts the local-motion feature calculated at the estimated position, while the green line shows the current model of the local motion. 4.5 Conclusion In this chapter we have proposed a combined visual model for probabilistic tracking which is composed of two visual models of the tracked object. The first model is the color-based model which we have proposed in Chapter 3. Since this model alone cannot resolve situations when the tracked object gets into a close proximity of another, visually similar object, we have proposed a novel visual model, which we call the local motion. The local-motion model was presented in Section 4.2, where we have shown how it can be calculated from the optical flow (Section 4.1.1) which is estimated using the Lucas-Kanade algorithm. While the Lucas-Kanade algorithm is relatively fast, it gives poor estimates of the optical flow in regions which lack texture. For that reason, we use the Shi-Tomasi features to detect regions with enough texture and estimate the optical flow only at those regions. Thus the local-motion is defined using only a sparse optical flow. To account for the errors in the optical flow estimation and rapid changes in the target’s motion, we have derived a probabilistic model of the 6=414 4.5 Conclusion 77 (a) (b) Figure 4.9: Images from the recording used in the experiment of tracking person’s hands. The upper row (a) shows the results for tracking with the purely color-based tracker Tcol, while the results for the proposed Tcmb are shown in the bottom row (b). In all images, the current estimated position of the person is depicted by the white ellipse, while the estimated velocity is depicted by the cyan line. In the lower-row images (b), the red line depicts the local-motion feature calculated at the estimated position, while the green line shows the current model of the local motion. local-motion in Section 4.2.1. Since the local-motion model significantly changes during the target’s movement, an adaptation scheme for the local-motion model was devised in Section 4.2.2. In Section 4.3 we have shown how the proposed local-motion model can be probabilistically combined with the color-based model into the combined probabilistic visual model. We have also proposed a particle-filter-based tracker which uses the combined model. Several experiments have been carried out to demonstrate improvement of tracking performance when using the combined model instead of a purely color-based model. Experiments included tracking persons in a dynamic clutter, tracking through occlusions and an example of tracking a persons’s hands through multiple occlusions. The experiments have clearly shown the superiority of the combined model over the purely color-based model by significantly reducing the number of failures during tracking. Since the proposed combined tracking scheme can help resolve ambiguities associated with multiple visually similar targets, it can be used as an extension 78 Combined visual model to existing multiple-target tracking schemes, such as [141], to increase their robustness. Note also that the local-motion-based feature (Section 4.2) is general enough to be used not only within the framework of particle filters, but also with non-stochastic methods, e.g., the recently proposed AdaBoost tracker [54], to resolve the ambiguities caused by the visual similarity between different objects. And yet it moves. Galileo Galilei (1564-1642) Chapter 5 A two-stage dynamic model In the previous two chapters we have explored two different visual models of the target to improve tracking in light of poor visual information. In this chapter, we will focus on another important part of the probabilistic tracker – the dynamic model by which we describe the dynamics of the tracked target. If the dynamics are well modelled, then the tracker’s performance can be improved in several ways. One improvement may be more accurate estimates of the target’s position and prediction, while the other is a smaller failure rate. Note that since the particle filters are Monte Carlo methods, the accuracy of target’s estimates depends on the number of particles used in the filter. Larger numbers of particles allow denser coverage of the target’s state space and usually result in a more accurate tracking. However, using more particles means more evaluations of the likelihood model. Depending on the complexity of the target’s visual model, these evaluations can be time-consuming, and can significantly slow down the tracking. By choosing an appropriate dynamic model, the particles can be used more efficiently. This can be achieved by directing them into the regions of the state space, which are more likely to contain the current state of the target. Smaller number of particles may therefore be used to achieve equal or even better accuracy than a poorer dynamic model would have achieved with a larger number of particles. In this sense, at a fixed accuracy, a proper dynamic model can effectively reduce the computation time required for a single tracking iteration and making the tracking feasible for real-time applications. When tracking persons, one of the following two models is used in practice: the random-walk (RW) model or the nearly-constant-velocity (NCV) model. The simplest of the two models, the RW model, assumes that velocity can be modelled 79 80 A two-stage dynamic model by a white noise process and is appropriate when the target performs radical accelerations in different directions; for example, when a person suddenly changes direction of its motion or suddenly stops. On the other hand, in cases when the target moves in a certain direction, the RW model usually performs poorly and the motion is much better described by the NCV model. Unlike the RW model, the NCV model assumes that the acceleration can be modelled by a white noise process. While it is true that at times human motion can be explained best by either the RW or the NCV model, the motion is often somewhere in between the two models. In this chapter we propose a two-stage dynamic model which is able to better model the dynamics of the human motion. We call the model a ”two-stage” dynamic model due to its particular structure, which is composed of two different models: a liberal and a conservative model. The liberal model allows larger perturbations in the target’s dynamics and is able to account for motions in between RW and NCV. On the other hand, the conservative model assumes smaller perturbations and is used to further adapt the liberal model to the current dynamics of the tracked object. We also propose a two-stage probabilistic tracker based on the particle filter and apply it to tracking entire persons as well as person’s hands. The outline of this chapter is as follows. In Section 5.1 we develop the liberal dynamic model and analyze how the parameters of the model influence the model’s structure. The conservative model is proposed in Section 5.2 and in Section 5.3 we propose the two-stage dynamic model and the corresponding probabilistic tracker. In Section 5.4 results of the experiments are reported, and we summarize the chapter in Section 5.5. 5.1 The liberal dynamic model As noted, the RW model assumes that the target’s velocity is a white-noise sequence and is thus temporally completely noncorrelated. On the other hand, the NCV model assumes that velocity is temporally strongly correlated, since the changes in velocity arise only due to the white noise of the acceleration. Thus the conceptual difference between RW and NCV models is that they assume two extremal views on the temporal correlation of the velocity. With this rationale we can arrive at a more general model by simply treating the velocity as a correlated noise, but without deciding on the extent to which it is correlated. A convenient 5.1 The liberal dynamic model 81 way to model the correlated noise is to use a Gauss-Markov process (GMP). The GMP has been previously used with some success (see, e.g., [146, 147, 176]) in modelling the acceleration of an airplane in flight, which allowed an improved tracking during air maneuvers. In this section we show that by modelling the velocity with a Gauss-Markov process, we arrive at a model of which RW and NCV are only special cases and which is able to account for more general dynamics; we will call this model the liberal model. We also provide an analysis of the parameters of the liberal model. We start by noting that changes in the position x(t) arise due to a non-zero velocity v(t) of the target, i.e., x(t) = v(t). The velocity v(t) is then modelled as a non-zero-mean correlated noise v(t) = v(t)+v(t), (5.1) where v(t) denotes a zero-mean correlated noise and v(t) is the current mean of the noise - the input velocity. We model the correlated noise v(t) as a Gauss-Markov process with an autocorrelation function R(r) = ae~ß^, where a2 is the variance of the process noise, and ß is the correlation time constant. To derive the dynamic model of the process (5.1) in a form which we can use for tracking, we have to first find a stochastic differential equation (s.d.e.) of the process (5.1), governed by a white-noise process, and then find its discretized counterpart. To derive the s.d.e. of the the process v(t) in (5.1), with the correlation function Rd(r), we have to find a shaping filter (see, e.g., [26], page 137), with a system transfer function G(s)1, which transforms a unity-white noise u(t) into v(t) (see Figure 5.1). In principle, the transfer function G(s) can be easily found from the spectral densities of the input u(t) and the output v(t) of the system shown in Figure 5.1. Although the derivation of the shaping filter can be usually found in textbooks on theory of estimation and control, we provide a summarized derivation here for completeness. We start by noting that, for a stationary process, the Wiener-Khinchine relation (see, e.g., [26], page 86) tells us that the spectral density Sa(s) of the process v(t) is given by the Fourier transform T[] of the process autocorrelation function R(r); therefore we have 1We use s to denote the complex frequency jw, where w has the usual meaning of the frequency in hertz. 82 A two-stage dynamic model Figure 5.1: A shaping filter G {s) that takes a unity white-noise signal u{t) and transforms (shapes) it into v(t). It can also be shown (see, e.g., [26], page 130) that the spectral density Su(s) of the input u(t) and the spectral density Sy(s) of the output v(t) of the system in Figure 5.1 are related as Sv(s) = G(s)G(-s)Su(s). (5.3) Since the spectral density of the unity-white noise is unity, i.e., Su(s) = 1, we have from (5.2) and (5.3) Sz(s) = G(s)G(-s)l = T[Rv(t)}: which gives the system transfer function, i.e. the shaping filter, ß^ß G{s) s + ß (5.4) (5.5) A stochastic differential equation that corresponds to the shaping filter (5.5) is exactly the s.d.e. which we seek and is given as v{t) = -ßv{t) + y/q;u{t), (5.6) where qc = 2ßa2 is the spectral density of the equivalent white-noise process acting on d(t) and where, as before, u(t) denotes a unit-variance white-noise process. The continuous-time s.d.e. of (5.1) can now be derived by expressing v(t) in (5.1) and plugging it into (5.6), v(t) = -ßv{t) + ßv{t) + y/q c u{t). (5.7) In order to arrive at a discretized form of the above model, we first note from (5.1) that i(t) ^{v{t) -v{t)) and assume that the input velocity v{t) remains g interval. Thus we obtain v(t) = -ßv{t) + ßv{t) + y/qcu{t). (5.8) ?t - constant over a sampling interval. Thus we obtain 5.1 The liberal dynamic model 83 Since ±{t) = v (t), we can write the complete system s.d.e. in the matrix form X(i) = ° l X(i) + P v(t) + P JqMt), (5.9) w 0-/3 W /3 1 V where we have defined X(i) = [x(t), w(t)]T. The model (5.9) is a linear model with time-invariant system matrices, which makes the discretization of this system a straightforward matter (see, Appendix B). The discretized counterpart of the continuous-time liberal model (5.1) with discretized states Xfc = [xk,vk]T is Xfc = $Xfc_! + Yvk_x + Wk, (5.10) q e-At/3 ' ^ _ e-At/3 where 4-i is the input velocity for the current time-step k, At is the time-step length, and Wk is a white-noise sequence with a covariance matrix Q= qn qu Qc (5.11) Qu Qi2 912 922 Qu 912 922 ^(2Atß-l + ie-Atß-e-2Atß): 1 /i _|_ -2At/3 _ 2p-At/3\ 912 2ß2 g22 = L(l_2e-2A*/3). 2/3 Note that there are two parameters which can be set in the liberal model (5.10, 5.11): one is the correlation-time parameter ß and the other is the spectral density qc of the noise. In the following we first give an analysis of how the parameter ß influences the structure of the proposed liberal model. Then we propose a method for selecting the spectral density qc for a given class of objects. 5.1.1 Parameter ß In terms of the parameter ß, the dynamic properties of the liberal model (5.10) can be considered as being in between a random-walk and a nearly-constant-velocity model2; this can be seen by limiting ß to zero, or to infinity. In the case 2Recall that the conceptual difference between the RW and the NCV model is in the way they treat the time-wise correlation of the target’s velocity. For completeness, we give derivations of the RW and NCV models in the Appendix B.1 and Appendix B.2, respectively. _ 84 A two-stage dynamic model of ß -»¦ 0, the model takes the form of a pure NCV model with a state transition matrix $^0 and the input matrix I>^0 0 M 'r^0= 0 ^-o= J ^ ,T^o= P . (5.12) On the other hand, at /3 -»¦ oo, the model takes the form of a RW model with the state transition matrix ^^ and the input matrix Tß^x ri o i Til oo 'r^°°= i ^/3-oo= J q ,r^oo= | • (5.13) Note that the values of iy^«, are nonzero, thus the input velocity has to be set to zero, t)fc_i = 0, to obtain the pure random-walk model. For comparison of the system and input matrices (5.12, 5.13) with those of a NCV and RW model, see Appendix B.l and Appendix B.2 We have seen thus far that the liberal dynamic model takes the structure of RW and NCV models at the limiting values of ß. But what happens when ß is set to somewhere in between zero and infinity? To get a better understanding of that, it is beneficial to rewrite the model in the following way. Let xfc = [xk,vk]T denote the state at the time-step k with position xk and velocity vk, and, similarly, let xfc_! = [xfc_i, vk_x}T denote the state at the previous time-step k - 1. We also rewrite the elements of the system transition matrix $ and the input matrix T (5.10) in the following abbreviated form [i *i,a] r=Ul Note from (5.10) that $ and T depend on the size of the time-step At, which is the time between one and the next measurement. Without loss of generality we can set the time-step to unity At = 1. For completeness, let us also define the values of the noise terms, at time-step k, acting on the position and velocity by Wk = [wxk,wvk]T. Now we can rewrite the liberal model (5.10) in terms of the state’s components as xk ^-i + ^-i+W-i+% (5.14) Vk = 2 are depicted by the dashed line, while the values of ^ are depicted by the full line. Similarly, in (b), the values of 02)2 are depicted by the dashed line, while the values of72 depicted by the full line. In both images, the upright dash-dotted line depicts the values of (j)1>2, fa,2, 71 and 72 at ß = 2. For convenience, these values are written out at the marked locations. This means that 1>2 and 7l are the proportions in which the internal velocity vk_x and the input velocity vk_x will be combined into the deterministic part of the velocity acting on the current position xk3 Similarly, 02)2 and 72 are the proportions in which the internal velocity vk.x and the input velocity vk.x will be combined into the deterministic part of the velocity acting on the current velocity vk. With At fixed, the values of the mixing factors 1>2, 2, 7i and 72 depend solely on ß. We show this dependence in Figure 5.2. Form the Figure 5.2 we see that by increasing ß, the influence of the input velocity vk.x increases in (5.14), and for a very large ß, the internal velocity vk.x is completely disregarded by the dynamic model as 1>2 and 02)2 of (5.14) tend to zero. On the other hand, 71 and 72 tend to zero for small values of ß. This 3The nondeterministic part of the velocity acting on the current position xk is the white noise wxk. 2 8 2 8 86 A two-stage dynamic model means that we can consider ß as a parameter that specifies an a-priori confidence of the input vk_x and internal velocity vk_x. If, for example, we know that vk_x is very accurate, then ß should be set to a very large value. Otherwise, smaller ß should be used. The two-stage dynamic model which is presented in this chapter usually yields reasonable estimates of the input velocity ^-i for a large class of targets. In practice we have observed that it is thus beneficial to let the input velocity vk.x have a dominant effect over vk_x in estimating the current velocity vk. However, if we want the liberal model to be able to account for a greater agility of the target, it is also beneficial to let the internal velocity vk.x to have a greater effect on predicting the current position xk. We have found that these requirements are sufficiently well met at ß « 2 which is the value we use in all subsequent experiments. The values of 1>2 and 7l at ß = 2 are shown in Figure 5.2a, while the values of 02)2 and 72 at ß = 2 are shown in Figure 5.2b. 5.1.2 Selecting the spectral density Another important parameter of the liberal model (5.10) is the spectral density qc of the process noise (5.11). Note that in many cases it is possible to obtain some general characteristics of the dynamics of the class of objects which we want to track. Specifically, the expected squared distance o2m that objects of certain class travel between two time-steps is often available. Assuming that we have some estimate of o2m, and that the time-step size ?t and the parameter ß are known, we now derive a rule-of-thumb rule for selecting the spectral density qc. To derive the rule-of-thumb, let us consider the following example. Assume that at time-step k - 1 a target is located at the origin of the coordinate system, i.e., xk_x = 0, and begins moving with a velocity vk_x ~ q22qc, i.e., Xfc_! = [0,wfc_i]T. Assuming that the input velocity ^-i in (5.10) is zero, the target’s state after a single time-step is Xfc = ?Xfc_! + Wk. (5.15) The covariance of the position at time-step k is P = (XfcX^) (?Xfc_1X^,_1?T) + fc-it>fc-i> + F - eg (5.32) was calculated. The term cf] was the cost value (5.31) of TTS, while C(LEF presented the cost value of the reference tracker TREF. In our case the null hypothesis H0 was that TTS is not superior to TREF. For each tracker we calculated the sample-performance-difference mean ? = Vß ?« (5.33) and its standard error <*K \ 1 R J^ (?M-?)2. (5.34) The null hypothesis was then tested against an alternative hypothesis Hu that TTS is superior to the reference tracker TREF, using the statistic ^ Usually, the alternative hypothesis is accepted at a significance level of a if A > aa, where ßa represents a point on the standard Gaussian distribution corresponding to the upper-tail probability of a. In our experiments we used a = 0.05, which is a common practice in hypothesis testing. The results of the hypothesis testing on position and prediction with respect to a different number of particles in the particle filter are shown in Table 5.2 and Table 5.3. Table 5.2 shows the results for testing the hypothesis that TTS _ 96 A two-stage dynamic model is superior to TRW, while Table 5.3 shows the results for testing the hypothesis that TTS is superior to TNCV. The second and third column in Table 5.2 and Table 5.3 show the test statistic A. In all cases the test statistic is greater than /uo.05 = 1.645. From Table 5.2 we can thus accept the hypothesis that TTS is superior to TRW in estimating the position and the prediction at the a = 0.05 level. Similarly, from Table 5.3 we can also accept the hypothesis that the tracker TT(At) = eFA* = I + F At + ^F2At2 + ^F3At3 + tk r = f $(r)Gu(r)dr; t* W(tfc_i) = / $(r)Lw(r)dr. (B.3) where ifc = tfc_! + At, W(ifc_i) is the driven response at ifc due to the presence of the white-noise input during the (tk-i,tk) interval, and is itself a white-noise sequence. From now on, we will abbreviate notations of values at times tk by the subscripts (-)fc. Thus (B.2) may be written as xfc = *(At)xfc_1 + rufc_1 + Wfc_i. (B.4) To find the covariance matrix Qfc_! of the white-noise sequence Wfc_i, we have to evaluate the following expectation Qfc_! = (Wfc-iWj^) tk tk ([ d>(L)Lw(L)dL][ $(r?)Lw(r?)dr?]T) tk tk f f $(e)L(w(e)wT(r?))LT$(r?)T^r?. (B.5) Since the continuous input disturbance w(i) is assumed a white-noise random process with a zero mean and a spectral density matrix Qc, we have (w(L)wT(r?)) = Qcč(L-r?), (B.6) where 6(L - rj) is the Dirac-delta function. Finally, using (B.6), and assuming a time-invariant system, we can rewrite (B.5) into At Qfc-i = /$(e)LQcLT^(e)T^e (b.7) _ B.1 Random-walk dynamic model 147 The above equations describe the discretization process which we will now apply to derive the random-walk and the nearly-constant-velocity model. B.1 Random-walk dynamic model The random-walk model assumes that changes in position arise purely due to a random factor. In other words, the system velocity is modelled by a white-noise sequence, i.e., x˙(t) = w(t). Thus dynamics of a one-dimensional system governed by the random-walk is described by the following continuous-time stochastic differential equation (s.d.e) o o o ±(t)= ° ° x(i) + l w(t), (B.8) O O O where the system state x(i) is composed of position and velocity x(t) = [x(t),v(t)]T, and w(t) is a one-dimensional continuous white-noise sequence with spectral density qc. Following (B.1-B.7) the discretized s.d.e. is xfc $xfc_!+Wfc_i (B.9) r i on ? 01 where Wk-1 is a white noise sequence with covariance matrix (B.7) i o 0 1 r At o i 0 0 Q = Qc At ° , (B.10) 0 0 Note that in (B.9) we write ? instead of ?(?t) for brevity and that we have dropped the subscript (·)k-1 in (B.10) since the covariance matrix Q depends only on ?t, that is, the difference between the consecutive time-steps. B.2 Nearly-constant-velocity dynamic model The nearly-constant-velocity model assumes that while the changes in the position arise due to a nonzero velocity, the changes in the velocity arise purely due to a random factor. In other words, the system acceleration is modelled by a white-noise sequence, i.e., v˙(t) = w(t). Thus dynamics of a one-dimensional system 148 Appendix governed by the nearly-constant velocity is described by the following continuous-time stochastic differential equation 0 0 1 x(i) = 0 1 x(t)+ 0 w(t), (B.11) 0 0 1 where the system state x(i) is composed of position and velocity x(i) = [x(t), w(t)]T and w(i) is a one-dimensional continuous white-noise sequence with spectral density qc. Following (B.1-B.7) the discretized s.d.e. is xfc ^Xfc.i+Wfc.! (B.12) r 1 ?t " ? 01 where Wk-1 is a white noise sequence with covariance matrix (B.7) 1 ?t 0 1 \ |?t3 §?t2 1 |?t2 ?t Q=qc 3 22 , (B.13) ?t It is interesting to see that the covariance matrix Q implies correlation between current perturbation Wk-1 in position and velocity, even though w(t) in (B.13) is a one-dimensional white-noise process acting only on the target velocity at the time instant t, and is thus not correlated with any other time instant. Note that the correlation in the discretized form (B.13) arises because we are integrating the effects of the white-noise process over a nonzero interval ?t between consecutive time-steps. Biography Matej Kristan was born on October 30th, 1978 in Ljubljana. He finished a four-year secondary school programme at gimnazija Ledina and enrolled in the Faculty of Electrical Engineering, University of Ljubljana in 1997. Between years 2000 and 2003 he worked as a student with the Laboratory of Modelling, Simulation and Control on a robot-soccer project. The focus of his work was camera calibration and tracking. From year 2001 to 2002 he was employed by Zamisel d.o.o. where he worked as a development engineer on several projects. Results of his work were methods for automatic online adjustment of camera parameters, methods for rapid merging of 3D triangulated objects and automatic methods for calibration of laser CNC machines. In 2003 he received a diploma degree from the Faculty of Electrical Engineering, University of Ljubljana in the field of automatics – cybernetics. His thesis was on a system for tracking in robot soccer. In 2003 he also started his postgraduate study at the same faculty and joined the Laboratory of Imaging Technologies, where he worked from 2003 to 2005 on a commercial project of tracking in sports. In 2005 he received a master’s degree with thesis on application of sequential Monte Carlo methods to tracking in computer vision, and enrolled in a doctoral programme at Faculty of Electrical Engineering, University of Ljubljana. In 2006 he was employed as a researcher at the Visual Cognitive Systems Laboratory, Faculty of Computer and Information Science, University of Ljubljana. In 2007 he was also employed as a researcher at Machine Vision Group of prof. dr. Stanislav Kovaˇciˇc at the Faculty of Electrical Engineering, University of Ljubljana. Between years 2004 and 2008 he attended various international winter and summer schools on computer vision and statistical pattern recognition. Currently he is a researcher at the Visual Cognitive Systems Laboratory and the Machine Vision Group. Between years 2003 and 2008 he authored or coauthored six scientific journal papers, twenty conference papers, a technical report, a book chapter, a diploma thesis and a master’s thesis. 149 150 Biography Personal data First and last name Date of birth Birthplace Home address Citizenship Matej Kristan October 30, 1978 Ljubljana Roˇzna Dolina c.VIII/29a, Ljubljana Slovenian Education 1997 1997 - 2003 September, 2003 2003 - 2005 May, 17-20, 2004 September, 2005 2005 - 2008 August, 25-29, 2007 March, 9-14, 2008 Finished a Secondary School gimnazija Ledina. Study in a five-year university programme at University of Ljubljana, Faculty of Electrical Engineering. Graduated at Faculty of Electrical Engineering, University of Ljubljana, Slovenia. Title of the thesis: Sledenje objektov v robotskem nogometu. Postgraduate student, Faculty of electrical engineering, University of Ljubljana, Slovenia. Advanced School on Computer Vision, Pattern Recognition and Image Processing, Verona, Italy. Invited speaker: Prof. Brendan J. Frey, Course title: Bayesian Networks and Algorithms for Inference and Learning: Applications in Computer Vision, Audio Processing, and Molecular Biology. Reached a Master-of-Philosophy degree, Faculty of electrical engineering, University of Ljubljana, Slovenia. Title of the thesis: Sekvenˇcne Monte Carlo metode za sledenje oseb v raˇcunalniˇskem vidu. Doctoral study at Faculty for Electrical Engineering, University of Ljubljana. DIRAC/COSY Summer Workshop on Multi-Sensory Modalities in Cognitive Science, Gerzensee, Switzerland. VISIONTRAIN Thematic school on Understanding Behavior from Video Sequences, Les Houches, France. Biography 151 Professional and academic experience 2000 - 2003 2001 - 2002 2003 - 2005 2005 - 2006 2006 - 2008 2006 - 2007 Laboratory of Modelling, Simulation and Control, Faculty of Electrical Engineering, University of Ljubljana, Slovenia. Position: Student, Work: Spare-time work on a multi-agent management system in robot soccer. Areas of focus: tracking and camera calibration. Zamisel d.o.o, Ljubljana, Slovenija. Position: Development engineer, Work: Development of methods for automatic calibration of laser CNC machines. Methods for rapid merging of triangulated 3D objects. Automatic procedures for camera parameters optimization for visual inspection. Laboratory of imaging technologies, Faculty of Electrical Engineering, University of Ljubljana, Slovenia. Position: Researcher, Work: Development of a commercial application for motion analysis in team sports. Commissioned by: Inspireworks, NY, USA. Laboratory of imaging technologies, Faculty of Electrical Engineering, University of Ljubljana, Slovenia. Position: Researcher, Work: Development of laser system for measuring dimensions of facade plates. Commissioned by: Trimo d.o.o, Ljubljana. Visual Cognitive Systems Laboratory, Faculty of computer and information science, University of Ljubljana, Slovenia. Position: Researcher, Work: Development of cognitive systems for cognitive assistant, (EU project CoSy). Laboratory of imaging technologies, Faculty of Electrical Engineering, University of Ljubljana, Slovenia. Position: Researcher, Work: M2-0156 project CIVaBIS, sponsored by Ministry of Defence of Republic of Slovenia. 152 Biography 2007 - 2008 Laboratory of imaging technologies, Faculty of Electrical Engineering, University of Ljubljana, Slovenia. Position: Researcher, Work: technological project MIR, ”Autonomous vessel for measurement and logistics” – APSIS, sponsored by Ministry of Defence of Republic of Slovenia. 2007 - 2008 Laboratory of imaging technologies, Faculty of Electrical Engineering, University of Ljubljana, Slovenia. Position: Researcher, Work: M3-0233 project PDR, sponsored by Ministry of Defence of Republic of Slovenia. Honors and awards 2005 Received a best student paper award at International Symposium on Image and Signal Processing and Analysis ISPA 2005. Published work 1.01 Original Scientific Article [1] M. Kristan, M. Perše, S. Kovačič, and J. Perš, Closed-world tracking of multiple interacting targets for indoor-sports applications, Computer Vision and Image Understanding (2008), in press. [2] M. Perše, M. Kristan, G. Vučkovič, S. Kovačič, and J. Perš, A trajectory-based analysis of coordinated team activity in a basketball game, Computer Vision and Image Understanding (2008), in press. [3] M. Kristan, M. Perše, S. Kovačič, and J. Perš, Sledenje veˇc igralcev v ˇsportnih igrah na podlagi vizualne informacije, Electrotechnical Review 74 (2007), no. 1-2, 19-24. [4] M. Kristan, J. Perš, M. Perše, and S. Kovačič, A Bayes-spectral-entropy-based measure of camera focus using a discrete cosine transform, Pattern Recognition Letters 27 (2006), no. 13, 1419-1580. [5] G. Klančar, M. Kristan, and R. Karba, Wide-angle camera distortions and non-uniform illumination in mobile robot tracking, Robotics and Autonomous Systems 46 (2004), no. 2, 125 - 133. [6] G. Klančar, M. Kristan, and S. Kovačič, Robust and efficient vision system for group of cooperating mobile robots with application to soccer robots, ISA Transactions 43 (2004), no. 3, 329-342. 1.08 Published Conference Papers [7] V. Sulic, J. Perš, M. Kristan, A. Jarc, M. Perše, and S. Kovačič, Izvedba algoritma raˇcunalniˇskega vida na omreˇzni kameri, Računalniška obdelava slik in njena uporaba v Sloveniji (P. Božidar, ed.), 2008, pp. 35-40. 153 154 Published work [8] M. Kristan, D. Skočaj, and A. Leonardis, Incremental learning with Gaussian mixture models, Computer Vision Winter Workshop, 2008, pp. 25-32. [9] M. Perše, M. Kristan, J. Perš, G. Vučkovič, and S. Kovačič, Temporal segmentation of group motion using Gaussian mixture models, Computer Vision Winter Workshop, 2008, pp. 47-54. [10] D. Skočaj, M. Kristan, and A. Leonardis, Continuous learning of simple visual concepts using Incremental Kernel Density Estimation, International Conference on Computer Vision Theory and Applications, 2008, pp. 598-604. [11] M. Kristan, J. Perš, A. Leonardis, and S. Kovačič, A hierarchical dynamic model for tracking in sports, Proceedings of the sixteen Electrotechnical and Computer Science Conference, ERK07, September 2007, pp. 187-190. [12] D. Skočaj, A. Vrečko, M. Kristan, B. Ridge, G. Berginc, and A. Leonardis, Interaktiven sistem za kontinuirano uˇcenje vizualnih konceptov, Proceedings of the sixteen Electrotechnical and Computer Science Conference, ERK07, 2007, pp. 167-170. [13] M. Perše, M. Kristan,J. Perš, S. Kovačič, Automatic evaluation of organized basketball activity, Computer Vision Winter Workshop 2007 (Helmut Grabner Michael Grabner, ed.), 2007, pp. 11-18. [14] J. Perš, M. Kristan, M. Kolbe, and S. Kovačič, Tailgating detection using histograms of optical flow,Computer Vision Winter Workshop 2007 (Helmut Grabner Michael Grabner, ed.), 2007, pp. 19-26. [15] M. Kristan, J. Perš, A. Leonardis, and S. Kovačič, Probabilistic tracking using optical flow to resolve color ambiguities, Computer Vision Winter Workshop 2007 (Helmut Grabner Michael Grabner, ed.), 2007, pp. 3-10. [16] F. Erčulj, G. Vučkovič, J. Perš, M. Perše, and M. Kristan, Razlike v opravljeni poti in povpreˇcni hitrosti gibanja med razliˇcnimi tipi koˇsarkarjev, Zbornik naučnih i stručnih radova (N. Smajlovič, ed.), 2007, pp. 175-179. [17] M. Kristan, J. Perš, M. Perše, and S. Kovačič, Towards fast and efficient methods for tracking players in sports, Proceedings of the ECCV Workshop on Computer Vision Based Analysis in Sport Environments (J. Perš and D. R. Magee, eds.), May 2006, pp. 14-25. Published work 155 [18] M. Perˇse, M. Kristan, J. Perˇs, and S. Kovaˇciˇc, A template-based multi-player action recognition of the basketball game, Proceedings of the ECCV Workshop on Computer Vision Based Analysis in Sport Environments (J. Perˇs and D. R. Magee, eds.), May 2006, pp. 71-82. [19] J. Perˇs, M. Kristan, M. Perˇse, M. Bon, G. Vuˇckovic, and S. Kovaˇciˇc, Analiza gibanja igralcev med tekmami, Raˇcunalniˇska obdelava slik in njena uporaba v Sloveniji (P. Boˇzidar, ed.), 2006, pp. 97-103. [20] M. Kristan, J. Perˇs, M. Perˇse, M. Bon, and S. Kovaˇciˇc, Multiple interacting targets tracking with application to team sports, International Symposium on Image and Signal Processing and Analysis, September 2005, pp. 322-327. [21] M. Perˇse, J. Perˇs, M. Kristan, G. Vuckovic, and S. Kovaˇciˇc, Physics-based modelling of human motion using kalman filter and collision avoidance algorithm, International Symposium on Image and Signal Processing and Analysis, September 2005, pp. 328-333. [22] M. Kristan, J. Perˇs, M. Perˇse, and S. Kovaˇciˇc, Bayes spectral entropy-based measure of camera focus, Computer Vision Winter Workshop, February 2005, pp. 155-164. [23] M. Kristan, J. Perˇs, M. Perˇse, and S. Kovaˇciˇc, Implementacija Condensation algoritma v domeni zaprtega sveta,Proceedings of the thirteenth Electrotechnical and Computer Science Conference, vol. B, September 2004, pp. 179-182. [24] M. Kristan and F. Pernuˇs, Entropy based measure of camera focus, Proceedings of the thirteenth Electrotechnical and Computer Science Conference, vol. B, September 2004, pp. 179-182. [25] J. Perˇs, M. Kristan, M. Perˇse, and S. Kovaˇciˇc, Observing human motion using far-infrared (FLIR) camera - some preliminary studies, Proceedings of the thirteenth Electrotechnical and Computer Science Conference, vol. B, September 2004, pp. 187-190. [26] M. Perˇse, J. Perˇs, M. Kristan, and S. Kovaˇciˇc, Vrednotenje uˇcinkovitosti kalmanovega filtra pri sledenju ljudi, Proceedings of the thirteenth Electrotechnical and Computer Science Conference, vol. B, September 2004, pp. 191-194. 156 Published work Technical reports [27] M. Kristan, D. Skoˇcaj, and A. Leonardis, Approximating distributions through mixtures of Gaussians, LUVSS-TR-04/07, Faculty of Computer and Information Science, University of Ljubljana, September 2007. Monographs and Other Completed Works 2.01 Scientific Monographs [28] G. Klanˇcar, M. Lepetiˇc, M. Kristan, and R. Karba., Mobile robots : New research, ch. Vision System Design for Mobile Robot Tracking, pp. 117–141, Nova Science Publishers, 2006. 2.09 Masters Thesis [29] M. Kristan, Sekvenˇcne Monte Carlo metode za sledenje oseb v raˇcunalniˇskem vidu, Masters thesis, Faculty of Electrical Engineering, University of Ljubljana, September 2005. 2.11 Diploma Thesis [30] M. Kristan, Sledenje objektov v robotskem nogometu, Diploma thesis, Faculty of Electrical Engineering, University of Ljubljana, September 2003. Izjava Izjavljam, da sem doktorsko nalogo izdelal samostojno pod vodstvom mentorja prof. dr. Stanislava Kovačiča in somentorja prof. dr. Aleša Leonardisa. Izkazano pomoč drugih sodelavcev sem v celoti navedel v zahvali. V Ljubljani, 16. maj 2008. mag. Matej Kristan, univ. dipl. inž. el. 157