ERK'2020, Portorož, 351-354 351
Semantic Face Editing with GAN Latent Code Optimization
Martin Pernuˇ s, Vitomir
ˇ
Struc, Simon Dobriˇ sek
Faculty of Electrical Engineering, University of Ljubljana
E-poˇ sta:fmartin.pernus, simon.dobrisekg@fe.uni-lj.si
Abstract
Recent advances in generative modeling of images have
reached new heights in the task of image generation. The
generative model that produces the most photorealistic
and high resolution images is Generative Adversarial Net-
work (GAN). In our paper we propose a multi-task latent
vector optimization procedure that combines local GAN
inversion with face attribute constraints for the task of
semantic face editing. The experiments show that the
method produces visually pleasing and semantically cor-
responding face images.
1 Introduction
With the advancements in generative modelling with Gen-
erative Adversarial Networks (GANs) [5], image gener-
ation has achieved unprecedented image photorealism.
The latent space of GAN models provides a compressed
representation of high-resolution image, which can be
useful for semantic image manipulation. By manipulat-
ing the latent codes of GAN generated images, several
useful image editing applications have been proposed [1,
2, 19]. The advancements in this ﬁeld could have an
important impact toward automatic image editing tasks.
Such algorithms could enable image manipulation by sim-
ply specifying the desirable visual attributes. This could
have an important impact in art, media and entertainment
industry.
The process of obtaining a latent code of an image
and performing various operations has been popularized
by the variational autoencoder [12]. Research showed
that combining latent codes produced a learned image
manifold that did not just correspond to pixelwise aver-
aging, but had a sense of the intrinsic data distribution.
Since the GAN architecture lacks a direct way to ob-
tain an image’s latent code, GAN inversion optimization
method [1, 2] is usually applied.
Generative modeling and image editing tasks have
been largely focused on face modeling and editing as gen-
erative modeling relies on vast quantities of training ex-
amples. For example, state-of-the-art StyleGAN model
[9] is trained on 70,000 high resolution face images. Sev-
eral studies used the pretrained StyleGAN model to edit
face images. Our work is inspired by the work of [19],
where the latent code vector space of GANs was analyzed
and linear movements in vector space were proposed to
edit face images. Instead of analyzing latent space, we di-
rectly perform gradient-based optimization on latent code
with spatial and semantic constraints.
Although studies have analyzed latent code manipu-
lation for various semantic image operations, disentan-
glement of semantic variations of latent codes remains
challenging. In our work, we propose to use a local GAN
inversion technique to preserve the person’s identity while
modifying the selected face attributes. We test the pro-
posed method and visually analyze the results in compar-
ison with a competing method.
2 Related Work
Generative Adversarial Networks (GANs) are one of the
most used generative models for image modelling task.
Since the original proposal in [5], the majority of ad-
vances stem from better architectures and loss functions.
The architecture design was improved in [18], which pro-
posed convolutional GAN design. Karras et al. [10] ﬁrst
managed to produce megapixel images using progressive
learning of GANs. The model was further improved in
StyleGAN [8, 9], where the architecture is inspired by
style transfer architecture, resulting in state-of-the-art un-
conditional image generation. Loss function beyond the
one proposed in the vanilla GAN model were explored in
[6, 14, 16, 17].
Due to resource intensive training of GAN models,
recent research focuses on the analysis of pretrained GAN
models. Pretrained parametric space of GAN weights
was manually tuned in [3] to achieve localized deletion
and addition of objects in the output image. In [7] lin-
ear and non-linear walks in latent space were learned that
achieved some basic image manipulation, such as bright-
ness and zoom change. In [19], linear subspaces of latent
facial semantics were identiﬁed to edit face images.
These methods are based on GAN generated images.
GAN models are inherently constrained by the probabil-
ity distribution of the data they are trained on, which lim-
its the usability of latent codes for image editing. Abdal
et al. [1, 2] projected face images using GAN inversion
technique in an extended latent space before editing face
images in various ways. In our work, we also use a ver-
sion of GAN inversion with an extended latent space, but
we also consider various spatial and semantic constraints.
352
3 Methods
Our method is based on StyleGAN model [9]. StyleGAN
is the current state-of-the-art GAN method for uncondi-
tional image generation. We use the pretrained StyleGAN
model trained on Flickr-Faces-HQ dataset [8].
StyleGAN’s generator is deﬁned by two main com-
ponents: initial latent code non-linear mapping and the
generatorG that generates the ﬁnal image. Non-linear la-
tent code mapping is deﬁned asf :Z!W that maps
the initial Gaussian sampled latent code z to code w of
the same 512-d dimensionality. The purpose of mapping
f is the disentanglement of factors of variation that could
be present in the Z vector space due to its Gaussian distri-
bution. The generatorG maps latent spaceW to images.
Thew codes are mapped by 18 learned afﬁne transforms
as an input to the convolutional layers of the generator.
Each convolutional layer also receives a stochastic com-
ponent in the form of spatial noisen2N
S
that slightly
affects the output face image. See [9] for more details
about the model.
The goal of GAN inversion is retrieving the latent
code w and optionally n that best matches the image I
given pretrained generator G. As shown in [2], a wide
variety of images, including non-face images, can be em-
bedded in the extended latent spaceW
+
. The extended
latent space W
+
consists of a concatenation of 18 dif-
ferent 512-dimensionalw vectors. We also optimize the
noise component of StyleGANn.
To achieve the presence of selected face attribute on
the ﬁnal image, we also introduce a pretrained classi-
ﬁerC that predicts the presence of the selected facial at-
tribute. The predicted probability of the selected attribute
is denoted as ^ y(w;n) =C(G(w;n)).
Starting from a suitable initialization ofw andn, we
search for their optimized versionsw
  andn
  . We only
optimize a subset of noisen, the portion that affects the
face portion of the image. Our loss function is deﬁned as
L(w;n) =  mse
jjM  (G(w;n)  I)jj
2
2
+
  ce
(  ylog^ y(w;n)  (1  y)log(1  ^ y(w;n)))
(1)
whereM is the spatial mask that deﬁnes the region where
mean squared error part of the loss is deﬁned, I is the
target image,   mse
is the weighting constant for mean
squared error loss and   ce
is the weighting constant for
the binary cross entropy loss.
A special consideration must be taken with regard to
the spatial mask M. It must be large enough to pre-
serve enough face in the image to retain the face iden-
tity, while also allowing the rest of the image to change
enough to satisfy the selected face attribute constraint.
We deﬁneM with face segmentation model that is based
on the DeepLabv3 segmentation model [4]. We trained
the model to spatially predict the face components of in-
terest on the imageI based on the selected facial attribute.
We then binarize the prediction based on predeﬁned thresh-
old and blur it with Gaussian ﬁlter to allow smooth changes
to the face image.
4 Experiments
4.1 Implementation
The classiﬁerC is based on the state-of-the-art multi-task
neural architecture [20] and is trained on CelebA dataset
[15] for 23 epochs. The starting learning rate is 0.05 and
it decays to one tenth its current value every 40,000 steps.
The CelebA dataset contains 200,000 celebrity images
with 40 annotated facial attributes per image as well as
their identity information.
The face segmentation model is set as DeepLabv3
model [4] that is trained on CelebAMaskHQ dataset [13].
This dataset contains 30,000 images with the size of512  512 and 19 facial components and accessories such as
skin, nose, eyes, eyebrows, ears, mouth, lip, hair, hat,
eyeglass, earring, necklace, neck, and cloth. Each im-
age is annotated with a segmentation mask of facial com-
ponents. After grouping several semantically similar at-
tributes, we ended up with ’background’, ’mouth’, ’eye-
brows’, ’eyes’, ’earrings’, ’hair’, ’nose’ and ’skin’ facial
components that we use to train the face segmentation
model. The model is trained for 5 epochs with a learning
rate of3  10
  4
.
The latent code w is initialized with the latent code
mean, calculated by passing 10,000 Gaussian sampledz
vectors and calculating the mean of the StyleGAN’s feed-
forward network. The optimization ofw
  andn
  is run
for 2000 iterations. The learning rate is ﬁrst linearly in-
creased from 0 to10
  2
, then decayed back to 0 using co-
sine schedule during the last 500 iterations. The threshold
for binarizing the image is set to 0.9 and the size and the
standard deviation of Gaussian ﬁlter are set to 51 and 31,
respectively.  mse
is set to 100 and  ce
is set to 10.
The optimization algorithm of choice for all the mod-
els is the Adam optimization method [11].
4.2 Evaluation
For comparison with the existing face editing techniques,
we implemented the Interface method [19]. The main
idea behind the Interface method is ﬁnding a hyperplane
that optimally separates latent space based on support
vector machine and moving the latent code in the se-
lected direction. It operates inW space, which guaran-
tees that latent code will generate a face image. However,
W space is often lacking when performing GAN inver-
sion. Thus, the reconstructions do not match the original
image as well as they would withW
+
space.
To ﬁnd the individual facial attribute hyperplane we
followed the Interface method procedure. First we gen-
erated 500,000 StyleGAN images. We picked the 10,000
most positive and 10,000 most negative images per facial
attribute according to our classiﬁer C. The hyperplane
was then identiﬁed using linear support vector machine
on the latent codes of these images. For visualization we
move the latent codes for 0.5 the norm of the latent vector
as opposed to 3 times the norm in the original proposal,
since in our experiments the face images failed to pre-
serve any identity information in the latter case.
In Figure 1 we show the results of the compared Inter-
face method and our proposed method for several facial
353
attributes. The results visually indicate that our proposed
method retains the identity information better than the In-
terface method.
The advantages of our method in comparison with In-
terFace method can be summarized as better visual re-
sults and no pretraining requirement. However, once the
InterFace learns the optimal hyperplane, its method al-
lows immediate calculation of new latent vectors (and
new face images), while our method requires optimiza-
tion of latent code. Our optimization procedure takes ap-
proximately 10 minutes.
5 Conclusion and future work
In this paper, a method for changing facial attributes while
preserving identity is proposed. The method is based on
local GAN inversion technique with facial attribute con-
straints. Experimental results visually suggest that real
face images can be modiﬁed according to selected at-
tributes while preserving their identity.
Additional improvements could be made using disen-
tanglement techniques, which could modify latent space
in a way that changing a single facial attribute wouldn’t
affect other attributes, while allowing the pose of the per-
son to change. That is the topic of our further research.
Acknowledgments
This work was supported in parts by the ARRS Research
Programmes P2-0250 (B) Metrology and Biometric Sys-
tems, the ARRS young researcher program and the ARRS
research project J2-2501 - Deep generative models for
beauty and fashion (DeepBeauty).
References
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. “Im-
age2StyleGAN++: How to Edit the Embedded Im-
ages?” In: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition.
2020, pp. 8296–8305.
[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. “Im-
age2stylegan: How to embed images into the style-
gan latent space?” In: Proceedings of the IEEE in-
ternational conference on computer vision. 2019,
pp. 4432–4441.
[3] David Bau et al. “GAN Dissection: Visualizing and
Understanding Generative Adversarial Networks”.
In: International Conference on Learning Repre-
sentations. 2018.
[4] Liang-Chieh Chen et al. “Rethinking atrous convo-
lution for semantic image segmentation”. In: arXiv
preprint arXiv:1706.05587 (2017).
[5] Ian Goodfellow et al. “Generative adversarial nets”.
In: Advances in neural information processing sys-
tems. 2014, pp. 2672–2680.
[6] Ishaan Gulrajani et al. “Improved training of wasser-
stein gans”. In: Advances in neural information pro-
cessing systems. 2017, pp. 5767–5777.
[7] Ali Jahanian, Lucy Chai, and Phillip Isola. “On
the” steerability” of generative adversarial networks”.
In: International Conference on Learning Repre-
sentations. 2019.
[8] Tero Karras, Samuli Laine, and Timo Aila. “A style-
based generator architecture for generative adver-
sarial networks”. In: Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recogni-
tion. 2019, pp. 4401–4410.
[9] Tero Karras et al. “Analyzing and improving the
image quality of stylegan”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition. 2020, pp. 8110–8119.
[10] Tero Karras et al. “Progressive Growing of GANs
for Improved Quality, Stability, and Variation”. In:
International Conference on Learning Representa-
tions. 2018.
[11] Diederik P Kingma and Jimmy Ba. “Adam: A method
for stochastic optimization”. In: arXiv preprint
arXiv:1412.6980 (2014).
[12] Diederik P Kingma and Max Welling. “Auto-encoding
variational bayes”. In: arXiv preprint
arXiv:1312.6114 (2013).
[13] Cheng-Han Lee et al. “Maskgan: Towards diverse
and interactive facial image manipulation”. In: Pro-
ceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition. 2020,
5549–5558.
[14] Jae Hyun Lim and Jong Chul Ye. “Geometric gan”.
In: arXiv preprint arXiv:1705.02894 (2017).
[15] Ziwei Liu et al. “Deep Learning Face Attributes in
the Wild”. In: Proceedings of International Con-
ference on Computer Vision (ICCV). Dec. 2015.
[16] Xudong Mao et al. “Least squares generative ad-
versarial networks”. In: Proceedings of the IEEE
International Conference on Computer Vision. 2017,
pp. 2794–2802.
[17] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka.
“f-gan: Training generative neural samplers using
variational divergence minimization”. In: Advances
in neural information processing systems. 2016,
pp. 271–279.
[18] A. Radford, L. Metz, and S. Chintala. “Unsuper-
vised representation learning with deep convolu-
tional generative adversarial networks”. In: arXiv
preprint arXiv:1511.06434 (2015).
[19] Yujun Shen et al. “Interpreting the latent space of
gans for semantic face editing”. In: Proceedings
of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 2020, pp. 9243–9252.
[20] Simon Vandenhende et al. “Branched multi-task
networks: deciding what layers to share”. In: arXiv
preprint arXiv:1904.02920 (2019).
354
Figure 1: Comparison of InterFace [19] performance (ﬁrst row) to our method (second row) on several face editing tasks for an
example face image. The selected facial attributes from left to right are deﬁned as follows: original image, arched eyebrows, big
nose, bushy eyebrows, narrow eyes, smiling, wearing earring and wearing lipstick.
Figure 2: Comparison of InterFace [19] performance (ﬁrst row) to our method (second row) on several face editing tasks for an
example face image. The selected facial attributes from left to right are deﬁned as follows: original image, blond hair, brown hair,
gray hair, straight hair, wavy hair.