Intention

This post is for carefull distinguishing between what is readily available, what needs to be ported to PyTorch, what needs to be implemented from scratch, and what is unexplored.

There are 4 papers that can fulfill our task:

Title Link Code Link Framework
DeepPrivacy: A Generative Adversarial Network for Face Anonymization https://arxiv.org/abs/1909.04538 https://github.com/hukkelas/DeepPrivacy PyTorch
AttGAN: Facial Attribute Editing by Only Changing What You Want https://arxiv.org/abs/1711.10678 https://github.com/LynnHo/AttGAN-Tensorflow TensorFlow
StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation https://arxiv.org/abs/1711.09020 https://github.com/cosmic119/StarGAN PyTorch
StarGAN v2: Diverse Image Synthesis for Multiple Domains https://arxiv.org/abs/1912.01865 https://github.com/clovaai/stargan-v2 PyTorch

I will summarize the papers and also note down what can be learned from these papers.

What we need to adopt from these papers: metrics, datasets

DeepPrivacy: A Generative Adversarial Network for Face Anonymization

  • Generator is U-Net
  • High resolution is possible only with Progressive GAN training
  • requires bounding box annotation of privacy sensitive area, sparse pose estimation of the face, containing keypoints for the ears, eyers, nose, shoulder
  • authors provide a new dataset: Flickr Diverse Faces (FDF) which satisfy their requirements, www.github.com/hukkelas/FDF
  • they test over WIDER-Face dataset, http://shuoyang1213.me/WIDERFACE/
  • metric: Average Precision (AP)
  • compare with other methods: 8x8 pixelation, heavy blur, black-out
  • Generator: U-net, same as Progressive GAN
  • Discriminator: same as Progressive GAN
    • background information as conditional input to the start of discriminator, making the input have six channels instead of three
    • include pose information at each resolution of the discriminator
    • remove the mini-batch standard deviation layer

Pros:

  • models readily available
  • works on general datasets

Cons:

  • Difficult training
  • No control over anonynmization

AttGAN: Facial Attribute Editing by Only Changing What You Want

  • Why not use Fader networks? In Fader Networks, an adversarial process is introduced to force the latent representation to be invariant to the attributes. However, the attributes portray the characteristics of a face image, which implies the relation between the attributes and the face latent representation is highly complex and closely dependent. Therefore, simply imposing the attribute-independent constraint on the latent representation not only restricts its representation ability but also may result in information loss, which is harmful to the attribute editing.

Testing Formulation

\(x^a\) is tha face image with \(n\) binary attributes \(a = [a_1, ..., a_n]\).

\[z = G_{enc} (x^a)\]

\(b = [b_1, ..., b_n]\) are another attributes to be achieved

\[x^{\hat{b}} = G_{dec}(z, b)\]

Training Formulation

An attribute classifier is used to constrain the generated image \(x^{\hat{b}}\) to correctly own the desired attributes. Meanwhile, the adversarial learning is employed on \(x^{\hat{b}}\) to ensure its visual reality.

On the other hand, an eligible attribute editing should only change those desired attributes, while keeping the other details unchanged. To this end, the reconstruction learning is introduced to 1) make the latent representation z conserve enough information for the later recovery of the attribute-excluding details, 2) enable the decoder \(G_{dec}\) to restore the attribute excluding details from z.

\[x^{\hat{a}} = G_{dec} (z, a)\]

Extension for Attribute Style Manipulation

Converting binary attributes to continuous

Style controllers: \(\theta = [\theta_1, \theta_2, ..., \theta_n]\). We will bind each \(\theta_i\) and the \(i\)th attribute, and maximize the mutual information between the controllers and the output images to make them highly correlated. We add style controllers and a style predictor Q, and the attribute editing is reformulated as \(x^{\hat{\theta}\hat{b}} = G_{dec}(G_{enc}(x^a, \theta, b))\)

Experiments

Dataset: CelebA

13 Attributes: Bald, Bangs, Black, Hair, Blond Hair, Brown Hair, Busy Eyebrows, Eye-glasses, Gender, Mouth Open, Mustache, No beard, Pale Skin, Age