353 Informatica 26 (2002) 353-358 J. Brank Using Image Segmentation as a Basis for Categorization Janez Brank Department of Intelligent Systems Jožef Stefan Institute, Ljubljana, Slovenia janez.brank@ijs.si Keywords: image categorization, segmentation, generalized kernels Received: June 26, 2002 Image categorization is the problem of classifying images into one or more of several possible categories or classes, which are defined in advance. Classifiers can be trained using machine learning algorithms, but existing machine learning algorithms cannot work with images directly. This leads to the needfor a suitable way of representing or describing images such that learning algorithms can work with them. We consider a representation based on texture segmentation and a similarity measure between segmented images which has been used successfully in the related area of image retrieval. A generalized kernel for use with the support vector machine (SVM) algorithm can be built from such a similarity measure. We compare this approach with a more straightforward representation based on autocorrelograms, and we show that these two representations can be combined to obtain classifiers with higher categorization accuracy. 1 Introduction Besides textual and relational data, people increasingly have to deal with pictorial data, or data in the form of images. Large pictorial databases are being produced as archives digitize their collections, and additionally the World Wide Web contains a huge number of images. Apart from purely technical problems of storing and processing such large amounts of data, the emergence of large collections of images opens the problems of enabling the users to make sense of this data and find what they need. Image categorization deals with one aspect of this problem: given a set of images and a set of predefined categories or classes, we assume that each image should belong to one or possibly several of these categories. For a large collection it would be impractical to have a human observer categorize all the images, so we want to be able to classify the images automatically after a small number of images has been classified manually to be used for training the automatical classifiers. However, this view of image categorization as a machine learning task immediately opens up a new problem: existing machine learning algorithms generally cannot work with images directly. Instead, they often assume they will be dealing with instances described by vectors or tuples. We need to be able to represent images using structures of this kind to make use of existing machine learning algorithms. 1.1 Related work in image retrieval We can build on existing work in image retrieval, which is a related area where the problem of representation has already been encountered. In image retrieval, the user poses a query to the system and the system should find images that are somehow relevant to the query. Thus a way of representing the query, a way of representing images, and a way of comparing a query and an image (to determine if the image is relevant with regard to this query) are needed. One approach that is both technically feasible and useful enough to be commonly used in practice (e.g. in web image search engines such as Google) is to describe each image using a few keywords, and the user's query can then request images whose description includes or excludes particular keywords. However, this approach is only feasible if textual descriptions of images can be obtained automatically (e.g. from the HTML file that linked to an image); it is usually too costly to have a human maintainer prepare such descriptions manually for a larger database. In addition, this textual approach suffers from problems of polysemy: different people would use different words to describe an image, and the same words may mean different things to different people. Therefore it is often desirable to rely solely on what can be automatically extracted from the images themselves. The user's query is then often simply a request to look for images similar to a given query image or sketch (this approach is known as "querying by content", or "content-based image retrieval"). There are several close parallels between image retrieval and image categorization. In categorization, if a new image is similar to training images from a particular category, it should probably itself belong to that category; in content-based image retrieval, if an image from the database is similar to the query image, it should probably be shown to the user. Thus we see that both areas need a way of representing images and assessing similarity between them. Many image representations and similarity measures have been proposed in image 354 Informatica 26 (2002) 353-358 J. Brank retrieval, and we would like to examine some of them from the point of view of image categorization as well. One popular class of image representations is based on simplifying the image by approximating the color of each pixel by the nearest color from a predefined and fixed color palette; this can also be seen as partitioning (or quantizing) the space of all possible colors. Some information is then recorded about the presence of each color on the image. When simply the proportion of the image covered by (the pixels of) that color is stored, the resulting description is called a histogram [11], However, this disregards all spatial information (how the color is distributed around the image): for example, a large patch of red would affect the histogram of an image in the same way as a large number of red pixels scattered all over the image, which is surely undesirable. Several improved histogram-like representations of images have been proposed. For example, an auto-correlogram [4] records, for each color c and for a few small integers d, the probability that a pixel, chosen randomly at distance d from a randomly chosen pixel of color c, will itself be of the color c. This retains information about the amount of a color present on the image, but also records something about the spatial arrangement of each color. Still, all "global" representations of this type can be seen as somewhat rigid as they record a strictly fixed amount of data for each image. They cannot take into account the fact that some images are more complex than others, that an image may contain several objects, or that it may be helpful to distinguish between an (interesting) object and (uninteresting) background. 1.2 Image segmentation Another, more sophisticated, class of image representations is based on segmentation, or dividing an image into a set of regions such that each region is roughly homogeneous in color and/or texture. Each image is then represented by a set of regions; each region is typically described by a short vector that is a by-product of the segmentation procedure (containing e.g. the average color of the region, information about texture, and so on). Additionally, the location of each region on the image (i.e. which parts of the image are covered by that region) is often recorded as well. In general, regions might overlap, and each region might itself be composed of several disjoint parts; this is not necessarily problematic as they need not be shown to the user, and image similarity measures usually permit the regions to be disconnected/and sometimes work with overlapping regions as well. Representations based on segmentation can adapt well to differences in complexity between images, and have been used successfully in image retrieval [NRS99, WLW00]. Various segmentation algorithms have been proposed in the context of image retrieval [NRS99, WLW00]. These approaches are usually based on dividing the image into a grid of small "windows" (e.g. 4x4 pixels); each window is described by a short vector (containing e.g. the average color and possibly a few coefficients from the higher-frequency bands of a wavelet transform, in order to capture the presence of edges or texture), and these vectors are then clustered. Each of the resulting clusters contains vectors that lie close together in their vector space, and such vectors hopefully correspond to windows that are similar in appearance; therefore it makes sense to form a region from such a group of windows. The region thus obtained can be described by the centroid of the cluster, i.e. by the average of the vectors that describe the windows from which the region was formed. To use segmentation for image retrieval, it is also necessary to introduce a measure of similarity between segmented images. Such measures usually examine pairs of individual regions (one region from each image) and combine the measures of similarity or difference between regions into a single similarity measure between entire images. For example, the integrated region matching (IRM) measure [8] defines the distance between two images as a weighted sum of distances between regions, in which the weights are chosen so as to allow larger regions to have a larger influence on the similarity between images. To use the representations described above for image categorization, one could use global representations (e.g. autocorrelograms) in combination with any of several machine learning algorithms (such as support vector machines, SVM); or use a segmentation-based similarity measure with an algorithm that allows an arbitrary similarity measure to be plugged into it (e.g. the nearest-neighbor method). However, our earlier work [1] has shown that the nearest neighbor method, in combination with segmentation-based image similarity measures, results in rather unimpressive performance in comparison to SVM and global representations. It is therefore our goal to try using segmentation together with support vector machines. The main challenge here is that the SVM in its original formulation assumes all training and test examples to be described by vectors with the same number of components, while in the case of segmentation the description of each image has more structure than that, and the number of regions can also vary from image to image. 2 Support vector machines Support Vector Machines (SVMs) [3] are a relatively recent family of machine learning algorithms that have been used successfully in many application domains. In the most elementary form of this method, we assume that each training example is a vector from some d-dimensional real space, and that there are exactly two classes, called positive and negative. Several extensions to multiclass problems are possible [5], usually by converting one multiclass learning problem into several two-class problems (e.g. training one classifier for each pair of classes to separate members of one class from those of the other class). In SVM learning, we want to separate the positive vectors from the negative ones using a hyperplane such that the positive training vectors lie on one side of the plane and the negative ones lie on the other side. IMAGE CATEGORIZATION USING SEGMENTATION Informatica 26 (2002) 353-358 355 Additionally, to help make the classifier more robust and more reliable for use on unseen test vectors, we want the training vectors to lie as far from the separating hyperplane as possible. Maximizing this distance (known as the margin) from the plane to the nearest training example can be cast as an optimization problem in the following way. Let Xj be the /'th training vector, and y, its label (which equals +1 for positive examples and-1 for negative training examples. A hyperplane can be described by the equation w1 x + 6 = 0, where w is the "normal", i.e. a vector perpendicular to the plane, and b is a threshold that determines the actual location of the plane in space. wTx denotes the dot product of the vectors w and x. Given a particular vector x, we can determine what side of the plane it lies on by examining whether wTx + b is positive or negative. However, to ensure that the training examples do not lie too close to the plane, we must also insist that wrx + b has a large enough absolute value. We can describe this using the following conditions: yt = 1 => w7Xi + b > 1 and y, = -1 => y/x, + b < -1, or, more concisely: yj(w7x, + b) > 1 for all training instances i. If all training examples satisfy these conditions, the space between the hyperplanes Wx + b = 1 and w'x + b = -1 is empty; to maximize the breadth of this margin space, we need to maximize the distance between these two planes, which equals 2/||w||. Maximizing the margin is thus equivalent to minimizing ||w||2 subject to the above conditions. This optimization problem is usually also extended to allow some training instances to be misclassified (or at least lie within the margin, though perhaps on the correct side of the separating plane) if this leads to a wider margin on the other training instances (the soft margin formulation of SVM). Solving the optimization problem gives us the values of w and b, and the resulting classifier simply works according to the formula prediction(x) = sgn[vv7x + 6], Using standard techniques from optimization theory, this optimization problem can be transformed into a "dual" form. It turns out that the dual form, as well as the resulting classification rule, can be expressed so that the training vectors need never be accessed directly, as long as we are able to compute the dot product of any two vectors. In particular, the normal w can be written as w = Z, ajyiXj, where the a, coefficients are obtained by solving the dual optimization problem. The classifier can then be described asprediction{x) = sgn[b + E, a.,y,x/x\. Now suppose we used some mapping cp to map our original instances x; into some other (possibly higher-dimensional) vector space F. Let K(x„ xj) := (cp(x,), cp{xj))f be a function that, given two instances x, and xp computes the dot product {-,-)F (in the new space F) of their images cp(x,) and i: and 2,: X—$F2, combining their features (or attributes) would be equivalent to a new representation c(>: X^>F1xF2 defined by the formula (¡)(x) = (