I recently obtained the **best generative results** ever achieved with deep networks without the use of adversarial training.

Below is a sample of the kind of faces I was able to generate using CelebA as training set, with a **moderate use of computing resources**:

The result is based on two different insights, that I will briefly discuss in the following sections:

- a new balancing policy between reconstruction error and kullback-leibler divergence in the VAE loss function

- a renormalization operation for two stage VAEs, compensating the loss of variance typical of (Variational) autoencoders.

### Balancing reconstruction error and KL-divergence

The loss function of Variational autoencoders has the following shape:

The first component is meant to enhance the quality of reconstructed images, while the second component is acting as a regularizer of the latent space, pushing it toward the prior distribution . The two components have contrasting effects, and their balance is a delicate issue.

Our solution consists in keeping a **constant balance** along training. Supposing has a Gaussian shape, the log-likelihood is just the mean squared error between the reconstructed image and the original sample . During training, we normalize this component by an estimation of the current error, computed over minibatches. In this way, the KL component cannot easily prevail, a fact that would forbid any further improvement in the quality of reconstructions.

More details can be found in this article:

Andrea Asperti, Matteo Trentin,
*Balancing recosntruction error and Kullback-Leibler divergence in Variational Autoencoders*, arXiv preprint arXiv:2002.07514 (2020).

### Variance Loss and renormalization

Data Generated by variational auotencoders seem to suffer from a systematic **loss of variance** with respect to the training set. The phenomenon is likely due to averaging, that in turn could be caused by dimensionality reduction (as in the case of PCA), or by sampling in the latent space, typical of the variational approach.

This is relevant, since generative models are evaluated with metrics such as the Frechet InceptionDistance (FID) that precisely compare the distributions of (features of) real versus generated images.

The variance loss becomes particularly dangerous in a two stage setting, where a second VAE is used to sample in the latent space of the first VAE. The reduced variance creates a mismatch between the actual distribution of latent variables and those generated by the second VAE, that hinders the beneficial effects of the second stage.

A simple solution is just to renormalize the output of the second VAE towards the expected normal spherical distribution (or better towards the moments of this distribution as e.g. computed by the variance law discussed in one of my previous posts).

This simple operation typically results in a sensible improvement in the perceptual quality of generated images, and a remarkable burst in terms of FID.

In the image below, you see the effect of our technique on images generated from a same seed: on the left you have the original image, and on the right the result of the latent space renormalization.

More details can be found in this article:

Andrea Asperti
*Variance Loss in Variational Autoencoders*, arXiv preprint arXiv:2002.09860 (2020).