Variational Autoencoders (VAEs), in comparison with alternative generative techniques, usually produce images with a characteristic and annoying blurriness.

A possible explanation can be given in terms of an alternative problem, that really seems to be the opposite facet of the same phenomenon: the variance of generated data is significantly lower than that of training data.

This problem of VAEs is not particularly known, that is precisely the reason why I decided to write a post on this argument.

First of all, what variance are we interested in?

Typically, the output of the network is a tensor with shape where is the batchsize and is the dimension of the output. To fix the ideas, let us suppose that is a coloured image, so that . We want to measure the variance of each single pixel along the batchsize axis, that is how the pixel varies on the data under consideration. Then, we compute the mean of all variances we obtained in this way.

Working with numpy, the formula we are interested in is

Now, we compute this expression on a large number of real data (e.g. on the validation set), on an equal number of reconstructed (or generated) images, and take the difference between them:

You can take your favourite VAE and make the above simple computation. We would expect the two values to be approximately equal. Especially, there would be no reason to expect that the variance of one of the two sets should be higher or lower than the other: the difference could be positive as well as negative.

**But this is not the case.** As you will discover, the difference is positive and not negligeable, that precisely means that we are testifying a **loss of variance in generated data.**

## The entity of the variance loss

Can we estimate the amount of this loss? Experimentally, it turns out to be approximately equal to the mean squared error between real and reconstructed images:

Note that, contrarily to the previous formula, mse is positive by definition.

To make a working example, consider the simple variational autoencoder for Mnist defined by F.Choillet in this post: Building autoencoders in keras.

In this case (after 50 epochs of training), we have

A lucky coincidence, you might say. Well, quoting master Kenobi, let me say that, in my experience, there is no such thing as luck.

Since I conjectured the above relation, I systematically measured it on all the VAEs I had the oppurtunity to work with. In a couple of years I collected a quite large number of data, summarized in the Figure below.

The different colors refer to different neural architectures: blue = Dense Networks; red = ResNet-like; green = Convolutional Networks; yellow = Iterative Networks (DRAW, GQN-like). The point is not to emphasize the difference between the various architectures (that would require a more extensive evaluation), but to remark that the relation between and mse is** systematic**, and largely **independent from the specific architecture**.

Of course, if the variance loss is of the same magnitude of the mean squared error, reducing the reconstruction error may also implicitly solve the variance issue. However, loglikelihood, in conjunction with the underlying averaging due to dimensionality reduction and the variational approach, really seems to be the cause of the loss, so merely trying to improve reconstruction could not be the right approach.

Intuitively, VAEs are, by their nature, extremely prudent in their predictions (at the very opposite of GANs). Even when generated samples are really good, as in the case of the recent NVAE, they still have a patinated look, and an extremely conventional appearance.

What we would like is to find a way to force a bit more of “verve” in their behaviour, increasing the “temperature” of generated samples.

If you are interested, you may find additional details on this topic in this article:

Andrea Asperti, Variance Loss in Variational Autoencoders. Sixth International Conference on Machine Learning, Optimization, and Data Science. July 19-23, 2020 – Certosa di Pontignano, Siena, Italy