Advanced correlation filters for facial landmark localization V 1 v O 1 Vitomir Struc1, Jerneja Zganec Gros2, Nikola Pavešic1 1 Faculty of Electrical Engineering, University of Ljubljana Tržaška 25, SI-1000 Ljubljana, Slovenia Alpineon Ltd., Ulica Iga Grudna 15, SI-1000 Ljubjana, Slovenia E-mail: vitomir.struc@fe.uni-lj.si,jerneja.gros@alpineon.si, nikola.pavesic@fe.uni-lj.si Abstract The paper develops a novel technique for facial landmark localization based on advanced correlation filters. Specifically, it introduces a new class of advanced correlation filters, named Principal Directions of Synthetic Exact Filters or PSEFs for short, and applies them to the problem of eye localization. To improve upon the basic performance of the PSEF filter for eye localization two types of constraints (i.e., soft and hard constraints) that affect the outcome of the localization procedure are also proposed and incorporated into the procedure. The effectiveness of the developed localization technique is demonstrated on more than 40000 facial images pooled from the FERET and LWF databases. The results of our experiments suggest that the PSEF filters produce significantly better localization results than, for example, the Haar-cascade object detector, while ensuring a more than 10-fold improvement in the processing time. 1 Introduction In recent years we have witnessed an increased interest in so-called advanced correlation filters, which have proven extremely successful in solving complex tasks related to pattern recognition in computer vision, e.g., face or palm-print recognition, object detection, tracking, etc. The interest in these types of filters is fueled not only by their efficiency, but also by some of their properties, such as mathematical simplicity, computational efficiency and robustness to distortions [1]. In general, advanced correlation filters bear a resemblance to templates and correlation-based template matching techniques, where patterns of interest in images are searched for by cross-correlating the input image with one or more example templates and examining the resulting correlation plane for large values - also known as correlation peaks. With properly designed templates, these correlation peaks can be exploited to determine the presence and/or location of patterns of interest in the given input image [1]. Early template matching techniques relied on rather primitive templates, computed, for example, through simple averaging of the available training images. Contemporary methods, on the other hand, use correlation templates (also referred to as advanced correlation filters) that are constructed by optimizing specific performance criteria [1], [2]. Examples of existing correlation filters can be found in [3], [4], [5] or [6]. In this paper we focus on a class of correlation filters called Principal directions of Synthetic Exact Filters (PSEFs) that we have originally introduced in [2]. These filters generalize upon the recently proposed class of advanced correlation filters called Average of Synthetic Exact Filters (ASEF) [6]. Based on these filters and a number of localization constraints we develop a facial landmark localization procedure and demonstrate its effectiveness in comparison with ASEF filters and the established Haar cascade classifier proposed in [7]. 2 Preliminaries ASEF filters represent a recently proposed class of advanced correlation filters that have already proven successful in various computer vision problems [6]. Similar to other correlation filters, a pattern of interest in an image is detected with an ASEF filter by cross-correlating the input image with the given ASEF filter and examining the resulting correlation plane for possible correlation peaks. However, ASEF filters differ from most existing correlation filters in the way they are constructed. Unlike the majority of correlation filters, which define only a single correlation value per training image, ASEF filters predefine the entire correlation plane for each available training image. As stated by Bolme et al. [6], this correlation plane commonly features a high peak centered at the pattern of interest and (near) zeros at all other image locations (second image in Fig. 1) [2]. Such a synthetic correlation output results in a synthetic exact filter (SEF) (third image in Fig. 1) that can be used to locate the pattern of interest in its corresponding training image. Obviously, SEF filters do not exhibit broad generalization capabilities, instead they produce distinct peaks only for those images that were used for their construction. To overcome this shortcoming Bolme et al. [6] computed a new filter by averaging all of the synthetic exact filters corresponding to a specific pattern of interest. Figure 1: Construction of a synthetic exact filter (SEF): normalized input image multiplied with a cosine window (left), the synthetic correlation output plane (middle), the synthetic exact filter corresponding to the training image on the left (right). Through the averaging operation the authors ensured better generalization capabilities of the ASEF filters when compared to the SEFs and avoided the over-fitting problem that affects many existing correlation filters Consider a set of n training images xi, x2,..., xn and n corresponding image locations of the pattern of interest. The first step towards computing the ASEF filter for a pattern of interest is the construction of the desired correlation outputs y 1, y2,..., yn for all n training images: y i(x,y) = e (x-xi )2 + (y-yj)2 for i = 1, 2, ...,n, (1) where a denotes the standard deviation of the Gaussian-shaped correlation output and (xi, yi) represents the coordinate pair corresponding to the location of the pattern of interest in the i-th training image. Once the correlation outputs have been determined, SEFs are calculated for all n pairs (xi; yj) as follows: H* = Yi 0 X * Xi 0 X* + e' for i = 1, 2, (2) H * = H* n ^—^ (3) To apply the ASEF filters for localization of a pattern of interest in an input image, the input image in first cross-correlated with the appropriate ASEF filter and the correlation output is then examined for its maximum. The location of the maximum is simply declared the location of the pattern of interest. In the frequency domain this can be defined as follows: Y = Xt 0 H * (4) where, Xi = F(xi) and Yi = F(yi) denote the Fourier transforms of the i-th training image and its corresponding synthetic correlation output, Hi = F(hi) stands for the Fourier transform of the i-th SEF filter hi, e denotes a small constant that prevents divisions by zero, © stands for the Schur product and * for the conjugate operator. In the final step, all n SEFs are simply averaged to produce an ASEF filter (see second image of Fig. 2 for a visual example) that can be used to locate the pattern of interest in a given input image. Here, the ASEF filter in the frequency domain is defined as [6]: Figure 2: Visualization of the facial landmark localization procedure (from left to right): the input image, the ASEF filter (with shifted quadrants), the correlation output, the input image with the detected correlation maximum. 3 PSEF filters The filter construction procedure described in the previous section ensures high generalization capabilities of the ASEF filters through an averaging procedure applied on the SEF filters. However, it implicitly presumes that the SEF filters represent a random variable drawn from a unimodal symmetric distribution and, thus, that their distribution is adequately described by their sample mean. To derive our PSEF filters we will make a similar assumption and assume that the SEF filters are drawn from a multi-variate Gaussian distribution. Under this assumption, we are able to extend the concept of ASEF filters to a more general form The basic reasoning for our generalization stems from the fact that the first eigenvector of the correlation matrix of some sample data corresponds to the data's mean (or average), while the remaining eigenvectors encode the variance of the sample data in directions orthogonal to the data's average. By using more than only the first eigenvector of the SEF correlation matrix for the localization procedure, we should be able to further improve upon the localization performance of the original ASEF filters [2]. Again consider a set of n training images x1;..., xn, for which we have already computed n corresponding SEFs for some pattern of interest h1; h2,..., hn, (where hi = F-1(Hi) stands for the i-th SEF filter defined in the spatial domain). Assume also that the SEFs reside in a d-dimensional space and that they are arranged into a column-data matrix Z G Rdxn. Instead of simply averaging the SEFs to produce an ASEF filter, we compute the sample correlation matrix £ of the SEFs: £ = ZZT G Rdxd, and adopt its leading eigenvectors as our PSEF filters, i.e.: Sfj = A j fj, wherej = 1, 2,..., min(d,n) (5) and A1 > A2 > • • • > \j • • • > Amin(d,„). where Y denotes the correlation output in the frequency domain, Xt = F(xt) denotes the Fourier transform of a test image xt, H stands for the ASEF filter in the frequency domain and © again represents the Schur product. The procedure is also illustrated in Fig. 2. One problem arising as a consequence of such a construction procedure is the sign ambiguity of the PSEF filters j. Since the computed filters can be multiplied by -1 and still represent valid eigenvectors of £, we have to alleviate this sign ambiguity. In the experimental section we will try to solve the sign ambiguity of our filters through preliminary experiments on some training data. 3.1 Utilizin g linearity The landmark localization procedure using PSEF filters is identical the one presented in Section 2, except for the fact that we have more than a single filter at our disposal and, hence, obtain more than one correlation output: Yj = Xt 0 F*, for j € {1, 2,..., min(d,n)}, (6) n Advanced correlation filters for facial landmark localization 211 Figure 3: Comparison of the visual appearance of an ASEF filter (left) and the combined PSEF filter (right). where X, = .Fix, j denotes the Fourier transform of a given test image x,. Fj denotes the Fourier transform of the j-th PSEF filter fv and Yj refers to the j-th correlation output in the Fourier domain. To determine the location of our pattern of interest in the given input image, we need to examine all correlation outputs Yj for maxima and combine all obtained information. A straight forward way of doing this is to examine only the linear combination of all correlation outputs for its maximum and use the location of the detected maximum as the location of our pattern of interest. Thus, we have to examine the following combined correlation output: yc = J2i=i wiJi- where y, denotes the correlation output (in the spatial domain) of the i-th PSEF filter, «', denotes the weighting coefficient of the i-th correlation output, yc denotes the combined correlation output, and k stands for the number of PSEF filters used ( I < h < min ( 1 we add additional information to the combined correlation output by including additional PSEF filters into the localization procedure. The presented procedure requires one filtering operation for each PSEF filter used. However, the computation can be speeded up by exploiting the linearity of the combination procedure. Instead of combining the correlation outputs, we simply combine all employed PSEF filters into one single filter with enhanced localization capabilities, i.e.: fc fc yc = Wi ® Xt) = (^2 Wi ) ® Xt = fc ® xt, (7) i = 1 i= 1 where fc = J2i=iw Ji- and J2i=iwi = Inthe presented equations fc stands for the combined PSEF filter and ® denotes the convolution operator. Note that the localization procedure with the combined PSEF filter has exactly, the same computational complexity as the procedure relying on ASEF filters regardless of the number of PSEF filters used. For our experiments the weights of the individual PSEF filters were selected as: m; = . , . An example of the visual appearance of the combined PSEF filter obtained with the presented weighting procedure (after the sign ambiguity has been eliminated - see Section 4) is shown on the right hand side of Fig. 3. 3.2 Incorporating localization constraints To improve upon the basic performance of the PSEF filters we incorporate two constraints into the the facial landmark localization procedure. The first, which we will refer to as our soft constraint in the remainder, represents a weighting function that is multiplied with the correlation output to give more weight Figure 4: Illustration of the soft constraint concept. to more probable landmark locations. The weighting function can be considered as sort of a prior model and is estimated by analyzing the locations of the landmark of interest on some training data and fitting a Gaussian distribution (with a diagonal covariance matrix) to these locations. The procedure is illustrated in Fig. 4. Here the first image depicts the average of our training set of 15520 face images after the face detection step with superimposed coordinates of the left eye from all images in the training set. The second image shows the estimated weighting function and the third image presents isohypses of the estimated Gaussian weighting function superimposed over the average face. The second constraint incorporated into the landmark localization procedure, referred to as our hard constraint in the remainder, is to limit the search space for the facial landmark of interest. When using this heuristic, we look for the left eye only in the upper left quadrant of the image and, similarly, we search for the right eye only in the upper right quadrant of the image. 4 Experiments and results To assess the landmark localization procedure relying on PSEF filters we make use of two face databases, namely, the FERET database [8] and the Labeled Faces in the Wild (LFW) database [9], We extract the facial regions from all images of the two databases using the Haar cascade classifier proposed by Viola and Jones [7], After detecting the facial regions in all images, we select 640 images from the LFW database and manually label the locations of the left and right eye. Next, we produce 40 variations of the facial region of each of the 640 LFW images by randomly shifting the location of the facial regions by up to ±5 pixels, rotating them by up to ±15°, scaling them by up to 1.0 ± 0.15 and mirroring them around the y axes. Through these transformations, we augment the initial set of 640 images to a set of 25600 images (of size 128 x 128 pixels) and employ them for training of the ASEF and PSEF filters. For testing purposes we apply the same random transforms to 3815 images from the FERET database. Here, we produce 12 modifications of each facial region, which results in 45780 facial images that can be used for our assessment. Prior to subjecting the face images to the proposed localization procedure, all face images are subjected to a log transform and normalized to zero mean and unit variance. In the last step the images are weighted with a cosine window to reduce the frequency effects of the edges encountered when applying the Fourier transform [6], To measure the effectiveness of the localization procedure we adopt the following criterion [10]: _ Hia.X (||/;e - Hell, \\lre ~ 'Ve||) ^ || Tie. ?Ve || (a) PSEF 1 (b) PSEF 2 (c) PSEF 3 (d) PSEF 4 (e) PSEF 5 Figure 5: Results of preliminary experiments aimed at alleviating the sign ambiguity of the computed PSEFs. where //, and /,.,. denotes the location of the left and right eye found by the assessed procedure, /7, and rre denote the reference location of the left and right eye, respectively, and the expression ||rie - rre|| represents the reference interoccular (L2) distance. For our assessment we examine the correct localization rate for different operating points, i.e., 1] < A G {0.10,0.15,0.20,0.25}. We use the soft constraint in all of our experiments with correlation filters, and state explicitly when we also adopt the hard constraint. The goal of our first series of experiments is to alleviate the sing ambiguity of the computed PSEF filters. To this end, we compute 5 PSEF filters (corresponding to the 5 largest, non-zero eigenvalues of Eq. 5), derive two filters from each of the 5 PSEF filters by multiplying them with +1 and -1, and normalizing the result to zero mean and unit variance. With the 5 computed filter pairs, we conduct localization experiments with the 45780 face images of the FERET database and plot the results in form of graphs as shown in Fig. 5. We select a threshold of A = 0.25 as the relevant operating point of our localization procedure and based on this value determine the appropriate sign of each of the five PSEF filters. Note here that more (or less) filters than 5 could be used for our experiments, the presented results, however, are enough to show the feasibility of our approach. If we take a look at the presented results in Fig. 5, we can see that in our case the best localization results are obtained with the first two filters being multiplied with +1 and the remaining filters being multiplied with -1. Furthermore, we can notice, that the best localization performance is obtained with the first PSEF filter, which in fact corresponds to an ASEF filter, while the remaining filters perform worse. Our second series of experiments comprises two types of tests. The first type does not rely on the hard constraint while the second type does. The results for the first type of experiments are shown on the left side of Fig. 6, while the results of the second type of experiments are shown on the right side of Fig. 6. Some numerical results for different values of A are also summarized in Table 1. Note that the proposed PSEF filters outperform both tested alternatives to eye localization, namely, ASEF filters as well as the Haar cascade classifier. In the third series of experiments we measured the execution times needed for the localization procedure. The best average time, computed by conducting the (left and right eye) localization procedure 10 times on all test images, was 46.3 ms for the Haar classifier (25.1 ms with the hard constraint) and 1.00 ms for the correlation filters (1.01 ms with the hard constraint). As a final note let us say that the ASEF filters require 0? 0? °0 0 05 0 1 0 15 0 2 0 25 °0 0 05 0 1 0 15 0 2 0 25 Figure 6: Comparison of different localization techniques with (right) and without (left) hard constraint. Table 1: Localization rates (in ° o) at different values of the localization criterion. V Without hard constraint With hard constraint Haar ASEF PSEF Haar ASEF PSEF 0.10 44.7 66.1 83.0 88.3 91.4 93.3 0.15 47.2 67.8 84.7 91.3 94.4 95.8 0.20 47.5 68.6 85.5 91.7 96.5 97.5 0.25 47.7 69.1 86.0 91.8 98.1 98.6 only a few minutes to be trained, since the rely only on a simple averaging operation. The PSEF filters require a few hours for their training, as this involves the computation of a large correlation matrix and its decomposition. Finally, the Haar classifier is known to have training times in the order of days or weeks. 5 Conclusion We have presented a new class of correlation filters called Principal directions of Synthetic Exact Filters and applied them to the task of eye localization. We have shown that the filters outperform the recently proposed ASEF filters and the established Haar cascade classifier at this task. References [1] B.V.K.V. Kumar, A. Mahalanobis, A. Takessian: Optimal tradeoff circular harmonic function correlation filter methods providing controlled in-plane rotation response. IEEE Trans, on Image Proc., vol. 9, no. 6, 1025-1034, 2000. [2] V. Struc, J. Zganec-Gros, N. Pavesic: Principal Directions of Synthetic Exact Filters for Robust Real-Time Eye Localization, In: Proc. of BioID, pp. 180-192,2011. [3] R.A. Kerekes, B.V.K.V. Kumar: Correlation filters with controlled scale response. IEEE Transactions on Image Processing, vol. 15, no. 7, 1794-1802,2006. [4] C.F. Hester, D. Casasent: Mulitvariant technique for multi-class pat. rec. App. Opt., vol. 19, no. 11, 1758-1761, 1980. [5] A. Mahalanobis, B.V.K.V. Kumar, D. Casasent: Minimum average correlation energy filters. Applied Optics, vol. 26, no. 17, 3633-3640, 1987. [6] D.S. Bolme, B.A. Draper, J.R. Beveridge: Average of synthetic exact filters. In: CVPR'09, pp. 2105-2112, 2009. [7] P. Viola, M.J. Jones: Robust real-time face detection. Int. J. ofComp. Vis., vol. 57, 137-154,2004. [8] P.J. Phillips, H. Moon, S.A. Rizvi, P.J. Rauss: The FERET evaluation methodology for face-recognition algorithms. IEEE TPAMI, vol. 22, no. 10, 1090-1104,2000. [9] G.B. Huang, M. Ramesh, T. Berg, E. Learned-Miller: Labeled Faces in the Wild. Technical Report 07-49,2007. [10] O. Jesorsky, K.J. Kirchberg, R.W. Frischholz: Robust face detection using the Hausdorff distance. AVBPA'01, Springer LCNS-2091, pp. 90-95,2001.