Elektrotehniški vestnik 74(5): 297-302, 2007 Electrotechnical Review, Ljubljana, Slovenija Unsupervised learning of scene and object planar parts Katarina Mele, Jasna Maver Univerza v Ljubljani, Fakulteta za racunalnistvo in informatiko, Trzaska 25, 1000 Ljubljana, Slovenija E-posta: katarina.mele@fri.uni-lj.si,jasna.maver@ff.uni-lj.si Abstract. In this work an adaptive method for accurate and robust grouping of local features belonging to planes of interior scenes and object planar surfaces is presented. For arbitrary set of images acquired from different views, the method organizes a huge number of local SIFT features to fill the gap between low-level vision (front end) and high level vision, i.e. domain specific reasoning about geometric structures. The proposed method consists of three steps: exploration, selection, and merging with verification. The exploration is a data driven technique that proposes a set of hypothesis clusters. To select the final hypotheses a matrix of preferences is introduced. It evaluates each of the hypothesis in terms of number of features, error of transformation, and feature duplications and is applied in quadratic form in the process of maximization. Then, merging process combines the information from multiple views to reduce the redundancy and to enrich the selected representations. The proposed method is an example of unsupervised learning of planar parts of the scene and objects with planar surfaces. Key words: unsupervised learning, visual learning, local descriptors, SIFT descriptor, feature grouping Nenadzorovano ucenje prizora in planarnih objektov v njem Povzetek. Metoda, predstavljena v članku, je namenjena nenadzorovanemu učenju prizora oziroma planarnih delov objektov, ki ga sestavlajo. Učenje je izvedeno s poljubnim naborom slik, zajetih iz različnih zornih kotov. Predlagani postopek je prilagojen za natančno in robustno razvrščanje velikanskega števila lokalnih deskriptorjev SIFT v skupine, ki določajo posamezne planarme dele v prizoru. Geometrijske enote, ki jih dobimo s takim urejanjem nizkonivojskih značšilk, so most med nizkim oz. zaznavnim in visokim oz. vsebinskim nivojem razumevanja vizualne informacije. Metoda je sestavljena iz treh korakov: raziskovanja prostora, izbire hipotez in zdruzevanja hipotez. Prvi korak, raziskovanje vizualne informačije, je podatkovno voden postopek, ki zgradi širši nabor hipotez. Izbor hipotez je izveden s kvadratno formo preferečne matrike, ki ovrednoti hipoteze glede na število značšilnič in transformačijsko napako, pri tem se podvajanje značilnič penalizira. V zadnjem koraku zdruzimo hipoteze, ki so isti planarni del, izračšunan iz slik, zajetih iz različšnih zornih kotov. Tako se izognemo podvajanju hipotez in hkrati obogatimo predstavitev posameznega dela prizora. Eksperimentalni rezultati potrjujejo uspešnost metode za nenadzorovanno učenje planarnih delov prizora in objektov. Ključne besede: nenadzorovano učenje, vizualno učenje, lokalni deskriptorji, deskriptor SIFT, grupiranje značšilnič. 1 Introduction The use of local features is becoming increasingly popular for solving different vision tasks. Recently, the SIFT descriptor has been proposed for describing distinctive Received 3 April 2005 Accepted 28 October 2005 scale-invariant features in images [7]. SIFT features can be used to perform reliable matching between different images of an object or scene. The invariance to image translation, scaling, and rotation makes them appropriate for stereo matching, tracking applications and also suitable for mobile robot localization. SIFT features are good natural visual landmarks appropriate for tracking over a long period of time from different views, e.g., in [10] the authors propose to use SIFT features for building 3D maps. Local descriptors have previously been used for scene description [4]. In [11, 9] local descriptors are used to extract objects from video clips but no 3D information about the object is generated. On the other hand in work of [6] 3D geometrical information is built about object surfaces. 3D geometrical presentation is model from range images. In this work we present a method for accurate and robust grouping of local features belonging to planes of interior scenes such as walls, floor, and the planar surfaces of objects. In for example [2, 5, 8] features such as line segments and junctions are selected for plane description. RANSAC algorithm is used to estimate transformation between images [13]. Here we experiment with SIFT descriptors as they uniquely describe a particular part of the scene. For an arbitrary set of images acquired from different views, the method organizes a huge number of local SIFT features to fill the gap between the low-level vision (front end), i.e. outputs of various filtering operations and high-level vision, i.e., domain-specific rea- Figure 1. Illustration of feature extraction. Each circle corresponds to a region described by the SIFT descriptor. soning about geometric structures. The proposed method consists of three steps: exploration, selection, and merging with verification. The exploration step is a data-driven technique that proposes a set of hypothesis models from which the selection step chooses the ones that explain the data in accordance with a matrix of preferences. Since the set of local features varies from view to view, the goal of the merging process is to combine the information from multiple views to reduce the redundancy and to enrich the selected representations. As demonstrated by experimental results, the proposed method is an example of unsu-pervised learning of planar parts of the scene and objects with planar surfaces. 2 Step 1: Exploration Given a set of descriptors of local patches of an interior scene, the goal is to group them into clusters in accordance with some geometric property or a model. Here we examine the planar surfaces. Let us assume that we have a set of images I = {I\,I2, ...In} of a particular interior scene. The first step of our approach is detection of DoG points and computation of SIFT descriptor for each local region [7] (Figure 1). Next, for each pair of images, {(h, Ij)|i < j,i = 1, . . . , N—1, j = 2,..., N}, a set of matching features is determined. The matches are obtained on the basis of the Euclidean distance between SIFT descriptors. Each SIFT feature in image Ii is compared to all SIFT features in image Ij. The feature has a match if the Euclidean distance to the closest SIFT feature is at least four times shorter than the Euclidean distance to the next closest SIFT feature. Let Sij denote a set of SIFT features of Ii having a match in Ij (Figure 2). Now, the task is to find in Sij the features that belong to planar parts of the scene and to group them in accordance with the plane they belong to. For this purpose we apply a plane-to-plane homography [3]. The Figure 2. The best matches from Si. computation of the plane-to-plane homography requires at least four features in two images of the same plane. For a larger set of points the system is over-determined and the plane-to-plane homography is estimated by a homogeneous estimation method. A reliable solution requires to start the process of plane searching with a large set of small SIFT feature clusters, i.e. the initial hypotheses. The features of Sij, here represented by their coordinates, {ft; ft = (x\, yt), t =1, 2,..., ISij|}, are clustered by the k-means clustering algorithm. The algorithm is performed several times, each time starting with different arbitrarily initial sets of cluster centers. The value k denotes the number of clusters obtained by one iteration and depends on the number of features ISij |. In the experiments the k was set to k = max{round(ISij |/30), 3}. The obtained clusters of features define a set of initial hypotheses Hj = {H}j, H?j,..., Hnj}. For each hypothesis Hj a plane to plane homography Pj from Ii to Ij is computed by applying the RANSAC algorithm (Algorithm 1). If the algorithm fails to find a solution, the proportions of features denoted by D and K are decreased by a factor 0.95 and the RANSAC is proceeded again. Next, the coordinates of all matching features of Sij are transformed to image Ij in accordance with transformation P1.. Displacement errors d(ft,ftP1j);t = 1, 2,..., \Sij \ are computed as Euclidean distances. All features with a displacement error below a pre-specified tolerance are included in the hypothesis (Figure 3). Note that features of the initial hypothesis can also be excluded from the hypothesis. Then, a plane-to-plane homography is recomputed and new features are included in the hypothesis. The process is stopped when there are no features that can be added to the hypothesis. Algorithm 1 Random Sample Consensus Algorithm. Assume: The parameters are estimated from D data items. There are T data items in total. (In our experiments D = 0.7 x T.) Tolerance t corresponds to the distance of maximal allowable displacement between features in a matching pair when transformed to the same image plane and is set to 1 pixel. 1. Select D data items at random. 2. Estimate parameters p. 3. Find how many data items of T fit the model with parameters p within a tolerance t. Call this K. 4. If K is big enough, exit with success. (In our experiments K = 0.8 x T.) 5. Repeat steps from 1 to 4 L times. (In our experiments L=100.) 6. Fail if you get here. 3 Step 2: Selection A redundant set of clusters results in many 'overlapping' hypotheses. To reduce the redundancy and to keep the hypotheses that efficiently group the data, a matrix of preference Q is introduced. It is preferred to have a hypothesis with a large number of features and small error-of-transformation encoded in diagonal elements of Q. The off-diagonal terms encode the interaction between the hypotheses. Duplication of features in different hypotheses is penalized. We consider only pairwise overlaps of the hypotheses. Selection of the hypotheses is performed by maximization of an objective function of quadratic form hQhT [12]. h is a binary vector of length n and denotes a set of selected hypotheses. A value 1 at position i indicates the presence of the i-th hypothesis and 0 its absence. Q is a n x n symmetric matrix. The elements of Q are defined as qcc = K\\Zc\ — K2£c,c; and qcr = -Ki\ZcC\Zr \+K2jc c = r. \Zc c-th hypoth- ij ). îc,r, so is defined as is the number of features in the esis Hj, i.e., \Zc\ = sum(Hc) called the error-of-transformation, max(E f £ \Zc nZr \ d(f,fPc )2,E f £ \ZcnZr \ d(f, fPij)2). The constants K\ and K2 are the weights determined experimentally. (In our experiments K\ = 4 and K2 = 1.) To maximize the objective function hQhT, we use the tabu search [1]. Vector h that maximizes the objective function represents the final selection. Figure 4 depicts the hypotheses selected by the proposed approach. Note that each of them describes one plane. 3.1 Hypothesis rejection Due to small differences in camera locations for some acquired image pairs, (Ii,Ij), the computed plane-to-plane homography lacks the sensitivity and therefore groups together SIFT features which do not lie on the same plane. See for example Figure 5. To refuse such hypotheses, the rejection process is applied to give the final set of hypotheses. For each hypothesis Hk we find all image pairs that contain matches relevant to the hypothesis. The plane-to-plane homography is determined for each such image pair. If for at least one image pair the plane to plane homography does not satisfy most of the matches, the hypothesis Hk is removed from further consideration. 4 Step 3: Merging Selections on pairs of images {(Ii,Ij)\i < j,i = 1,... ,N — 1,j = 2,..., N} end up with a set of final hypotheses H = {H\,..., Hm}. Each hypothesis determines a cluster of SIFT features. A SIFT feature is represented as a structure of feature coordinates (x, y), a SIFT (a) (b) Figure 3. (a) Initial hypothesis. (b) The hypothesis is enlarged by adding all the features that satisfy the prespecified tolerance of plane-to-plane homography P. vector, and a weight which determines the importance of the feature. At the beginning all the weights are set to 1. In I, there are images representing the same parts of the scene acquired from different locations and viewing directions. Hence, multiple hypotheses can determine the same parts of the scene. To reduce the redundancy and to enrich the final representation, we apply a merging process to H. SIFT descriptors are highly distinctive local parts of the scene, therefore even a small number of SIFT features uniquely determines the particular part of the scene. If in Hi and Hj there exists a subset of common matching features, the hypotheses are candidates for merging. It is still possible that Hi and Hj describe two different planar parts or different parts of a slightly bending surface. To filter out such cases, features in both hypotheses are examined in the following way. First, we divide the features of Hi and Hj in three subsets: A = Hi n Hj, B = Hi \ Hj, and C = Hj \ Hi. Next, we find all image pairs that contain matches from all the three above determined subsets. We require at least one match from each subset to do the merging. By applying a plane-to-plane homography to each such image pair we test if the matching features from subsets A, B, and C lie on the same plane. If for all such image pairs the test is positive, we merge Hi and Hj. Features of both hypotheses are transformed to the same image, the weights of features Hi and Hj weights are summed and all the SIFT descriptors are kept. The process of merging is repeated (also on newly generated hypotheses) until there is no pair of hypotheses with a sufficient number of matching features. The weights of features give us information about feature stability. Features with high weights are more stable while features with low weights are very likely to be outliers. The reader has to keep in mind that the merged hypotheses are still only hypotheses. By acquiring new images of the scene new information is obtained and the rejection of a hypothesis is still possible. 5 Experiments Results are presented for two experiments. In the first experiment the scene is fixed. In the second the configuration of objects in the scene is different for the acquired set of images. In both experiments we deal with gray images of resolution 640 x 480. In the first experiment the feature clustering was generated from 15 images leading to 86 final hypotheses. After the process of merging we end up with 8 different planes (Fig. 6). In the second experiment 10 different images were acquired. The process ends up with 54 hypotheses (Fig. 7). Some hypotheses of feature clusters of the same plane were not merged due to the sparse nature of SIFT features and insufficient number of the acquired images. The hypotheses are built from different images, showing the same planar part from angels where some parts are un- (a) Hypothesis 1 (front site of the printer) (b) Hypothesis 2 (wall newspaper) (c) Hypothesis 3 (coat hanger) Figure 4. Final set of hypotheses. Each of the selected hypothesis represents one plane. Figure 5. Refused hypothesis. seen or occluded by objects. Consequently also the results can show the location of features belonging to one cluster, even though the planar part is unseen or occluded by other objects. 6 Conclusion In this work we present a method for clustering the SIFT features belonging to planar surfaces. The clusters obtained through the phases of exploration, selection and merging can be used as initial structures for building higher-level scene representations. The proposed method represents unsupervised learning of objects with planar parts as demonstrated by the second experiment. The weights attached to the SIFT descriptors can also be exploited to detect changes in the interior scene, e.g. changes on the wall newspaper, a coat hanger, and would together with the time parameter allow for continuous long-time learning. 7 References [1] D. d. W. A. Hertz, E. Taillard. A tutorial on tabu search. In Proc. ofAIRO'95, pages 13-24, Italy, 1995. [2] C. Baillard and A. Zisserman. Automatic Reconstruction of Piecewise Planar Models from Multiple Views. CVPR, pages 2559-2565, 1999. [3] A. Criminisi, I. Reid, and A. Zisserman. A plane measuring device. In In Proc. BMVC, September, 1997. [4] G. Dorko and C. Schmid. Selection of scale-invariant neighborhoods for object class recognition. In Proceedings of the 9 th ICCV, Nice, France, pages 634-640, 2003. [5] Z. Kim and R. Nevatia. Automatic description of complex buildings from multiple images. In Comput. Vis. Image Un-derst., pages 1077-3142, 2004. [6] A. Leonardis Gradnja in Modeliranje parametri nih struktur v slikah In Electrotechnical Review, volume 3/4(58), pages 133-141, 1991. [7] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. In International Journal of Computer Vision, volume 60(2), pages 91-110, 2004. [8] I. K. Park, K. M. Lee and S. U. Lee Perceptual grouping of line features in 3-D space: a model-based framework. In Pattern Recognition, pages 145-159, 2004. [9] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce. Segmenting, Modeling, and Matching Video Clips Containing Multiple Moving Objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 29(3), pages 477-491, March 2007. Figure 6. Experiment with a fixed scene. Eight clusters of SIFT features belonging to eight different planar parts of the scene are found. Figure 7. Scene is altered from view to view. Eleven different clusters are found belonging to five different planar parts of the scene. Only six clusters are displayed. [10] S. Se, D. Lowe, and J. Little. Vision-based mobile robot localization and mapping using scale-invariant features. In Proceedings of the IEEE ICRA, pages 2051-2058, Seoul, Korea, May 2001. [11] J. Sivic, F. Schaffalitzky and A. Zisserman, Object Level Grouping for Video Shots. In Proc. Int. J. Compt. Vision, volume 67, number 2, pages 189-210, Hingham, MA, USA, 2006. work to extract parametric models. In Computer Analysis of Images and Patterns, pages 90-97, 1995. [13] J. Wills, S. Agarwal and S. Belongie, What Went Where. In Proc. IEEE Conf. Comput. Vision and Pattern Recognition, volume 1, pages 437-54, Madison, WI, June 2003. [12] M. Stricker and A. Leonardis. Exsel++: A general frame-