ERK'2021, Portorož, 352-355 352 Frequency Band Encoding for Face Super-Resolution Klemen Grm 1 , Vitomir ˇ Struc 1 1 University of Ljubljana, Faculty of Electrical Engineering E-mail: klemen.grm@fe.uni-lj.si Abstract In this paper, we present a novel method for face super- resolution based on an encoder-decoder architecture. Un- like previous approaches, which focused primarily on di- rectly reconstructing the high-resolution face appearance from low-resolution images, our method relies on a multi- stage approach where we learn a face representation in different frequency bands, followed by decoding the rep- resentation into a high-resolution image. Using quanti- tative experiments, we are able to demonstrate that this approach results in better face image reconstruction, as well as aiding in downstream semantic tasks such as face recognition and face verification. 1 Introduction Face super-resolution is a subset of the general single- image super-resolution problem. By itself, single-image super-resolution is the inverse problem of recovering a high-resolution image given a low-resolution (e.g., sub- sampled) version of it. The single-image super-resolution problem is highly ill-posed, since a single low-resolution image can correspond to many different high-resolution images - for a given sub-sampling factorf, the ratio of in- formation from the high-resolution image retained in the low resolution image is at most 1 f 2 given perfect sampling and anti-aliasing strategies. Unlike the general single- image super-resolution problem, however, its domain spe- cific subsets such as face super-resolution allow us to use the constraints on the high-resolution image appear- ance such as face identity information [2] or face compo- nent segmentation [8] to constrain the solution space and achieve better high-resolution reconstruction at higher mag- nification ratios than are typically considered in general single-image super-resolution applications. Existing work on super-resolution methods based on machine learning techniques (e.g., [3, 12, 6, 7, 13]) typi- cally consists of generating pairs of aligned low-resolution and high-resolution imagesx andy, respectively, by start- ing with a high-resolution image dataset and downsam- pling the images through a process x=Hy# d +N; (1) where H is an image filtering operator (e.g., the Gaus- sian or Lanczos filters),# is the sub-sampling operator,d (a) (b) (c) (d) Figure 1: Output samples of the proposed face super-resolution method. The figure depicts the input low-resolution images (a), interpolated to the output resolution using bicubic interpolation (b), compared against the results of the proposed method (c) and the high-resolution groundtruth (d). is the sub-sampling factor, andN is a noise component. Given the generated dataset of (x;y) pairs, a differen- tiable model m with free parameters is then trained to approximate the inverse process, i.e., to predict an ap- proximation of the high resolution image ^ y given a low- resolution imagex as an input, ^ y =m (x); (2) by setting the model parameters, through gradient descent on the loss function, typically a distortion mea- sure between the model predictions and the ground-truth high-resolution images, e.g., L(x;y)=km (x) yk p p ; (3) where p is the order of the norm used to measure the distortion between the model approximations and ex- pected (i.e., groundtruth) high-resolution images. 353 HR Image 192*192*3 HR Encoder + Decoder + HR Image 192*192*3 Figure 2: The architecture of the HR encoder and universal decoder segments of our model. The HR encoder produces the high- and low-frequency components of the face representation, where the high-frequency component is derived from the earlier layers of the network with smaller receptive fields, whereas the low-frequency component of the representation is derived from the final output of the convolutional backbone. LR Image 24*24*3 LF Encoder + HF Encoder + CAT Figure 3: The architecture of the low-resolution encoder and segment of our model. The LR encoder produces the high- and low-frequency components of the face representation from an input LR image by using two sub-models in a cascaded residual setup. Unlike previous approaches, our proposed method sep- arates the training of the upsampling model,m , into two steps, namely, i) learning an appropriate representation of face images for high-resolution reconstruction, and ii) performing the high-resolution reconstruction given the representation. This approach has several advantages over the end-to-end approach. Firstly, the representation can be derived only using high-resolution images and autoen- coder training, without involving a synthetic downsam- pling process such as (1). Secondly, our chosen encoder architecture explicitly splits the representation of high- and low-frequency bands of the input image, allowing both to be retained in the representation, which in turn al- lows the decoder to learn sharper reconstructions. Thirdly, given a trained decoder that produces high-resolution im- ages given a face image representation, we could then in principle train different encoders for images of differ- ent resolution or quality levels, as opposed to a universal model, allowing more fine-tuned results without needing different reconstruction models. To summarize, the key contributions of this paper are 1. We propose a novel, encoder-decoder based archi- tecture for the face super-resolution problem, 2. We present a training process that allows our model to learn face image representations with the explicit separation of information content from the high- and low- frequency bands of the input image, 3. We evaluate the proposed method on the standard image reconstruction task using distortion metrics, as well as on downstream semantic tasks, to demon- strate the effectiveness of our approach. 2 Methodology 2.1 Datasets We use the VGGFace2 [1] dataset to train our proposed method. We use the dataset because of its relatively large size (3:31 million images of9131 subjects) compared to datasets commonly used for face super-resolution train- ing such as Casia WebFace [15] and CelebFacesA [9], as well as higher-quality images overall in terms of sharp- ness. A large dataset of diverse, high-quality images is needed to best make use of large model capacity. We use the VGGFace2 test set (with non-overlapping sub- jects with regards to the training set) to evaluate the re- construction capability of our model. Furthermore, we also use the LFW dataset [5] to evaluate the utility of our proposed method to downstream semantic tasks, i.e., to face recognition. In order to downsample the high-resolution images (to generate the corresponding low-resolution inputs), we perform the following steps: 1. Resize the image to 256px along the short side, maintaining the aspect ratio of the original 2. Extract a random 192px square crop from the im- age 3. Flip the image horizontally with probability0:5 (at this step, the image represents the HR part of the (x;y) pair) 4. Perform Gaussian filtering on the HR image, with U (3;4) and kernel sizek =d4 e+1 5. Downsample the filtered image by a factor of 8 us- ing bicubic resampling 6. Desaturate the image contrast by a random factor fU (0:5;1) 7. Add Gaussian white noise with U (0; 10 255 ) to each color channel of the low-resolution image. 2.2 Model architecture The overall architecture of the proposed model is illus- trated in Figures 2 and 3. The model is comprised of three main components, namely, 354 The HR encoder, which produces a band-separated face representationz LF jjz HF given a high-resolution in- put face image. The CNN backbone is based on the VG- GFace [11] architecture, which has proven to be more useful for perceptual tasks and style transfer than newer image recognition models based on deep residual learn- ing [10]. The high-frequency band of the representa- tion is extracted from the earlier layers of the convolu- tional backbone, where the network has a limited recep- tive field and therefore lacks the capability to learn the low-frequency global structure of the image. In contrast, the low-frequency band of the face representation is de- rived from the final output of the convolutional backbone. The Universal decoder, which is trained to recon- struct a high-resolution image given the corresponding face representation. In order to avoid having to encode spatial information into latent vectors and the decoder having to interpret the spatial information, we opt to re- tain the spatial relations of the entries in the face repre- sentation by keeping it in its tensor form, i.e., given an input image I HR 2 R 192 192 3 , it is the case for the face representations that z HF 2 R 24 24 8 and z LF 2 R 24 24 8 . We decode the face representation into a high- resolution image using a modified EDSR [7] network where the first layer is changed to accept the 8-channel low- frequency band of the representation, whereas the high- frequency band of the representation is passed directly to a subpixel convolution [12] upsampling module, whose output is added to the final result of the decoder. The LR encoder, which is trained to produce a face representation ^ z LF jj^ z HF given an input low-resolution image. The encoder consists of two sub-models, which produce the respective frequency bands of the encoding. Both share the same architecture, i.e., an EDSR-like [7] convolutional backbone, without the upsampling mod- ule, since the spatial resolution of the face representa- tion is already equal to the spatial resolution of the in- put low-resolution images. The low-frequency band of the approximated representation is derived directly from the LF encoder. Then, ^ z LF is concatenated to the input low-resolution image along the channel axis and used as an input to the HF encoder. This allows the HF encoder to use information about which parts of the image are al- ready encoded in the low-frequency band of the represen- tation and focus on refining the output image. The final approximation of the representation, ^ z LF jj^ z HF , can then be used as an input to the universal decoder. 2.3 Training procedure We begin training our model with autoencoder training of the HR encoder and universal decoder, as illustrated in Figure 2. This combined model is trained to reconstruct high-resolution images presented to the HR encoder as input. We begin by only training the low-frequency band of the representation. Once the training converges, the high-frequency band path is activated to further refine the results. In Figure 4, we show an image reconstructed us- ing the HR encoder and universal decoder. Using this architecture, we are able to achieve near-perfect recon- structions of high-resolution images (SSIM > 0:99), (a) (b) (c) (d) Figure 4: A reconstructed high-resolution image using our HR encoder and universal decoder. The figure depicts (a) the out- put of the decoder given only the low-frequency band of the face representation, (b), the full output of the decoder, (c) the groundtruth high-resolution image, and (d) the difference be- tween the outputs from low-frequency only and full representa- tions. which means the decoder is well-suited for the face super- resolution tasks. We use theL 1 loss to train the HR en- coder and universal decoder with gradient descent, i.e., L(y;^ y)=E yT jy ^ yj =E yT jy m DEC (m ENC HR (y))j; (4) where the image y is considered as a sample of the training setT . Once the universal decoder is initialized through high- resolution autoencoder training, we train the low-resolution encoder model to produce approximations of the face rep- resentation from a low-resolution image, ^ z LF jj^ z HF . As above, the representations are passed to the universal de- coder to produce a reconstruction of the high-resolution image. To train the low-resolution encoder, we use a combination ofL 1 loss per-pixel on the reconstructed im- age, and a cycle consistency loss by passing the recon- structed image back to the high resolution encoder. The full loss is L(x;y)= 1 jy ^ yj+ 2 L CYC (x;y) = 1 jy m DEC (m ENC LR (x))j + 2 L CYC (x;y); (5) whereL CYC is the cycle consistency loss, i.e., L CYC (x;y)=jm ENC HR (y) m ENC HR (^ y)j =jm ENC HR (y) m ENC HR (m DEC (m ENC LR (x)))j; (6) and 1 and 2 are weights of the pixel loss and the cycle consistency loss, respectively. Using a logarithmic grid search, we experimentally set 1 = 10 3 , 2 = 1. The model is trained using the Adam [4] gradient descent optimization method, using a learning rate of 10 4 and batch size of64, until convergence. 3 Results We evaluate the high-resolution face reconstruction ca- pabilities of our model on the VGG2 test set, which con- tains 169396 images of 500 subjects, disjoint from the 355 Method SSIM Bicubic interpolation 0:5921 EDSR [7] 0:6487 C-SRIP [2] 0:6991 FBE-FSR (proposed) 0:7531 Table 1: Image restoration results on the VGG2 test set. SSIM results closer to 1 indicate better face super-resolution perfor- mance. Method Verification accuracy( ) Bicubic interpolation 0:8417 0:0101 EDSR [7] 0:9137 0:0079 C-SRIP [2] 0:9341 0:0052 FBE-FSR (proposed) 0:9465 0:0059 Original HR Images 0:9936 0:0042 Table 2: Face verification results on the LFW dataset. Results on the original high-resolution images are included for compar- ison against restored low-resolution images. training set. We generate LR images using the same pro- cess outlined in section 2.1. We measure the distortion between the groundtruth high-resolution images and the outputs of our model using the Structural Similarity In- dex (SSIM [14]) metric, the literature standard for eval- uating reference-based image restoration algorithms. We compare the restoration results against competing algo- rithms in Table 1. We note that in comparison to the C- SRIP [2] model, the proposed method is able to perform more convincing face image restoration on edge cases such as extreme pose (profile images), face occlusion, and face scale variation. We show some qualitative com- parisons of the model output in the Figure 5. We also evaluate the proposed method in terms of its utility to downstream semantic computer vision tasks, i.e., face recognition. We use the LFW [5] dataset for this experiment, because it is considered a “solved” dataset in the sense that modern face recognition models achieve near100% verification accuracy, and we are mainly inter- ested in the amount of performance degradation on down- sampled and super-resolved images. We downsample the images using a similar algorithm as outlined above, ex- cept the192px crop is always central, since the face im- ages are already aligned. Next, we perform image restora- tion on the generated LR images, and perform the stan- dard LFW face verification experiment using a pretrained VGG2-SENet-101[1] pre-trained face feature extraction model. The verification accuracy of the considered meth- ods (in terms of mean and standard deviation over the 10-fold evaluation protocol) are compared in Table 2. 4 Conclusion We have presented FBE-FSR, a novel method for face image super-resolution based on frequency component separation in the latent-space of an encoder-decoder ar- chitecture. We have shown that our proposed method is capable of outperforming our previous work and compet- ing methods in terms of image restoration capability and utility for downstream vision tasks. (a) (b) (c) (d) (e) Figure 5: A qualitative comparison of our proposed method with the next best tested method. The figure depicts LR input images (a), results of bicubic interpolation (b), C-SRIP (c), the proposed method (d), and the HR image (e). In terms of future work, we plan on extending the method by adapting additional encoder methods for spe- cific image degradation pipelines, and extending the train- ing framework with generative adversarial objectives. References [1] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vg- gface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018. [2] K. Grm, W. J. Scheirer, and V . ˇ Struc. Face hallucination using cascaded super-resolution and identity priors. IEEE Transactions on Image Processing, 29:2150–2165, 2019. [3] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016. [4] D. P. Kingma and J. Ba. Adam: A method for stochastic optimiza- tion. arXiv preprint arXiv:1412.6980, 2014. [5] G. Learned-Miller. Labeled faces in the wild: Updates and new re- porting procedures. University of Massachusetts, Amherst, Tech. Rep. UM-CS-2014-003, 2014. [6] C. Ledig, L. Theis, F. Husz´ ar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo- realistic single image super-resolution using a generative adver- sarial network. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 4681–4690, 2017. [7] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee. Enhanced deep residual networks for single image super-resolution. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017. [8] L. Liu, S. Wang, and L. Wan. Component semantic prior guided generative adversarial network for face super-resolution. IEEE Access, 7:77027–77036, 2019. [9] Z. Liu, P. Luo, X. Wang, and X. Tang. Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15(2018):11, 2018. [10] A. Mordvintsev, N. Pezzotti, L. Schubert, and C. Olah. Differentiable image parameterizations. Distill, 2018. https://distill.pub/2018/differentiable-parameterizations. [11] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recogni- tion. 2015. [12] W. Shi, J. Caballero, F. Husz´ ar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. [13] X. Wang, K. Yu, S. Wu, J. Gu, Y . Liu, C. Dong, Y . Qiao, and C. Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018. [14] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. [15] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014.