A stationary condition for Variational Autoencoders

14/2/2019

In this article, we continue our investigation of Variational Autoencoders (see our previous posts on the regularization effect of the Kullback-Leibler divergence, and the sparsity phenomenon). In particular, we shall point out an interesting stationary condition induced by the Kullback-Leibler component of the objective function.


Let us first of all observe that trying to compute relevant statistics for the posterior distribution Q(z|X) of latent variables without some kind of regularization constraint does not make much sense. As a matter of fact, given a network with mean \mu_z(X) and variance \sigma_z^2(X) for a given latent variable z, we can easily build another one having precisely the same behavior by scaling mean and standard deviation by some constant γ (for all data, uniformly), and then downscaling the generated samples in the next layer of the network. This kind of linear transformations are easily performed by any neural network (it is the same reason why it does not make much sense to add a batch-normalization layer before a linear layer).

Let’s see how the KL-divergence helps to choose a solution. In the following, we suppose to work on a specific latent variable z, omitting to specify it. Starting from the assumption that for a network it is easy to keep a fixed ratio \rho^2(X) = \sigma^2(X)/\mu^2(X), we can push this value in the closed form of the Kullback-Leibler divergence, that is

{\mathit{KL}(G(\mu(X),\sigma^2(X))||G(0,1))= (\mu(X)^2+\sigma^2(X) -\mathit{log}(\sigma^2(X)) -1) /2}

getting the following expression:

(\sigma^2(X)\frac{1 + \rho^2(X)}{\rho^2(X)}-log(\sigma^2(X))-1)/2  \hspace{1cm}(1)

In Figure 1, we plot the previous function in terms of the variance, for a few given values of \rho.

KL-divergence for different values of \rho:
observe the strong minimum for small values of \rho.

The above function has a minimum for

{\sigma^2(X) = \frac{\rho^2(X)}{1+\rho^2(X)}}\hspace{1cm}(2)

close to 0 when \rho is small, and close to 1 when \rho is high. Of course \rho depends on X, while the rescaling operation after sampling must be uniform, still the network will have a propensity to synthesize standard variations close to 0 \le \frac{\rho^2(X)}{1+\rho^2(X)} <1 (below we shall average on all X).

Substituting the definition of \rho^2(X) in equation (2), we expect to reach a minimum when \sigma^2(X) = \frac{\sigma^2(X)}{\mu^2(X)}\frac{\mu^2(X)}{\mu^2(X)+\sigma^2(X)}, that, by trivial computations, implies the following simple stationary condition:

\sigma^2(X) + \mu^2(X) = 1

Let us now average together the KL components for all data X:

\frac{1}{N}\sum_X \frac{1}{2}(\mu(X)^2 + \sigma^2(X)-log(\sigma^2(X)) -1)

We use the notation \widehat{f(X)} to abbreviate the average \frac{1}{N}\sum_X f(X) of f(X) on all data X.
The ratio \rho^2 = \frac{\widehat{\sigma^2(X)}}{\widehat{\mu^2(X)}} can really (and easily) be kept constant by the net. Let us also observe that, assuming the mean of the latent variable to be 0, \widehat{\mu^2(X)} is just the (global) variance \sigma^2 of the latent variable.

Pushing \rho^2 in the previous equation, we get

{\frac{1}{2}(\widehat{\sigma^2(X)}\frac{1 + \rho^2}{\rho^2}-\widehat{log(\sigma^2(X))} -1)}

Now we perform a somewhat rough approximation. The average of the logarithms \widehat{log(\sigma^2(X))} is the logarithm of the geometric mean of the variances. If we replace the geometric mean with an arithmetic mean, we get an expression essentially equivalent to expression (1), namely

{\frac{1}{2}(\widehat{\sigma^2(X)}\frac{1 + \rho^2}{\rho^2}-log(\widehat{\sigma^2(X)}) -1)}

that has a minimum when

\widehat{\sigma^2(X)} = \frac{\rho^2}{1+\rho^2}

that implies

\widehat{\sigma^2(X)} + \widehat{\mu^2(X)} = 1

or simply,

\widehat{\sigma^2(X)} + \sigma^2 = 1\hspace{1cm}(3)

where we replaced \widehat{\mu^2(X)} with the variance \sigma^2 of the latent variable in view of the consideration above.

Condition (3) can be experimentally verified. In spite of the rough approximation we did to get it, it proves to be quite accurate, provided the Variational Autoencoder is sufficiently trained. You can check it on your own experiments, or compare it with the data provided in our previous posts.

Let us finally remark that condition (3) is supposed to hold both for active and inactive variables.

Sparsity in Variational Autoencoders

In our previous article, we discussed the regularization effect of the Kullback-Leibler divergence in the objective function of Variational Autoencoders, providing empirical evidence that it results in a better coverage of the latent space.

In this article, we shall discuss another important effect of it: working in latent spaces of sufficiently high-dimension, the latent representation becomes sparse. Many latent variables are zeroed-out (independently from the input), the associated variance computed by the network is around one (while the real variance is close to 0), and in any case those variables are neglected by the decoder.

This property is usually known under the name of over-pruning, since it induces the model to only use small number of its stochastic units. In fact, this is a form of sparsity, with all the benefits typically associated with this form of regularization.

* * *

Sparsity is a well known and desirable property of encodings: it forces the model to focus on the relevant features of data, usually resulting in more robust representations, less prone to overfitting. 

Sparsity is typically achieved in neural networks by means of weight-decay L1 regularizers, directly acting on weights. Remarkably, the same behaviour is induced in Variational Autoencoders by the Kullback-Leibler divergence, simply acting on the variance of the encoding distribution Q(z|X).

The most interesting consequence is that, at least for a given architecture, there seems to exist an intrinsic internal dimension of data. This property can be exploited both to understand if the network has sufficient internal capacity, augmenting it to attain sparsity,  or conversely to reduce the dimension of the network removing links to unused neurons. Sparsity also help to explain a loss of variability in random generative sampling from the latent space one may sometimes observe with variational autoencoders

In the following sections we shall investigate sparsity for a couple of typical test cases.

MNIST

We start with the well known MNIST dataset of handwritten digits that we already used in our previous article. Our first architecture is a dense network with dimensions 784-256-64-32-16.

In Figure 1 we show the evolution during a typical training of the variance of the 16 latent variables.

Figure 1: evolution of the variance along training (16 variables, MNIST
case).  On the x-axis we have numbers of minibatches, each one of size 128. 
The variance of some variables goes to zero (as expected), but for other variables it goes to 1, instead. The latter variables will be neglected by the decoder.

Table 1 provides relevant statistics for each latent variable at the end of training, computed over the full dataset: the mean of its variance (that we expect to be around 1, since it should be normally distributed), and the mean of the computed variance σ2(X) (that we expect to be a small value, close to 0). The mean value is around 0 as expected, and we do not report it.

no variance mean(σ2(X))
0 8.847272e-05 0.9802212
10.000117560.99551463
26.665453e-050.98517334
30.974179270.008741336
40.991318170.006186147
51.00123430.010142518
60.945633770.057169348
70.000158410.98205334
80.946942750.033207607
90.000147890.98505586
101.00403750.018151345
110.985438760.023995731
120.0001074410.9829797
134.5068125e-050.998983
140.00010850.9604088
150.98863780.044405878
Table 1: inactive variables in the 784-256-64-32-16 VAE for MNIST digits

All variables highlighted in red have an anomalous behavior: their variance is very low (in practice, they always have value 0), while the variance σ2(X) computed by the network is around 1 for each X. In other words, the representation is getting sparse! Only 8 latent variables out of 16 are in use: the other ones are completely ignored by the generator.

As an additional confirmation of this fact, in Figure 2 we show a few digits randomly generated from Gaussian sampling in the latent space (upper line) and the result of generation when inactive latent variables have been zeroed-out (lower line): they are indistinguishable.

Figure 2: Upper line: digits generated from a vector of 16 normally sampled latent variables. Lower line: digits generated after “red” variables have been zeroed-out;
these latent variables are completely neglected by the generator

Let’s try with a convolutional VAE. We consider a relatively sophisticated network, with the following structure (see Figure 3)

Figure 3: arcbitecure of the convolutional VAE

In this case, sparsity is less evident, but still present: 3 variables out of 16 are inactive.

no variance mean(σ2(X))
00.000159390.99321973
10.91644579015826659
21.054788820.0062623904
30.981025690.012602937
41.073532930.0051363203
51.069324970.0066873398
66.477744e-050.96163213253
71.119550340.0031915947
80.887556430.024708110
90.979433000.0094883628
100.93228560.016983853
111.400598260.0025208105
121.302275650.0033756110
130.000193370.99533605
141.135975830.0076088942
151.334825630.002084088
Table 1: inactive variables in the conv-VAE for MNIST digits

Having less sparsity seems to suggest that convolutional networks make a better exploitation of latent variables, typically resulting in a more precise reconstruction and improved generative sampling.
This is likely due to the fact that latent variables encode information corresponding to different portions of the input space, and are less likely to become useless for the generator.

The sparsity phenomenon was first highlighted by Burda et al. in this work, and later confirmed by many authors on several different datasets and neural architectures. We discuss this debated topic and survey the recent literature in this article, where we also investigate a few more concrete examples.

The degree of sparsity may slightly vary from training to training, but not in a significant way (at least, for similar final values of the loss function). This seems to suggest that, given a certain neural architecture, there exists an intrinsic, “optimal” compression of data in the latent space. If the network does not exhibit sparsity, it is probably a good idea to augment the dimension of the latent space; conversely, if the network is sparse we may reduce its dimension removing inactive latent variables and their connections

Kullback-Leibler divergence and sparsity

Let us consider the loglikelihood for data X:

Ez∼Q(z|X) log P(X|z) − KL(Q(z|X)||P(z))

If we remove the Kullback-Leibler component from previous objective function, or just keep the quadratic penalty on latent variables, the sparsity phenomenon disappears. So, sparsity must be related to that part of the loss function.

It is also evident that if the generator ignores a latent variable, P(X|z) will not depend on it and the loglikelihood is maximal when the distribution of Q(z|X) is equal to the prior distribution P(z), that is just a normal distribution with 0 mean and standard deviation 1. In other words, the generator is induced to learn a trivial encoding zX = 0 and a (fake) variance σ2(X) =1. Sampling has no effect, since the sampled value for zX will just be ignored.

Intuitively, if during training a latent variable is of moderate interest for reconstructing the input (in comparison with the other variables), the network will learn to give less importance to it; at the end, the Kullback-Leibler divergence may prevail, pushing the mean towards 0 and the standard deviation towards 1. This will make the latent variable even more noisy, in a vicious loop that will eventually induce the network to completely ignore the latent variable.

We can get some empirical evidence of the previous phenomenon by artificially deteriorating the quality of a specific latent variable.
In Figure 3, we show the evolution during training of one of the active variables of the variational autoencoder in Table 1 subject to a progressive addition of Gaussian noise. During the experiment, we force the variables that were already inactive to remain so, otherwise the network would compensate the deterioration of a new variable by revitalizing one of the dead ones.

Figure 3: Evolution of reconstruction gain and KL-divergence of a latent variable
during training, acting on its quality by addition of Gaussian blur. We also show
in the same picture the evolution of the variance, to compare their progress.

In order to evaluate the contribution of the variable to the loss function we compute the difference between the reconstruction error when the latent variable is zeroed out with respect to the case when it is normally taken into account; we call this information reconstruction gain.

After each increment of the Gaussian noise we repeat one epoch of training, to allow the network to suitably reconfigure itself. In this particular case, the network reacts to the Gaussian noise by enlarging the mean values \mu_x(X) of the posterior distribution Q(z|X), in an attempt to escape from the noisy region, but also jointly increasing the KL-divergence. At some point, the reconstruction gain of the variable is becoming less than the KL-divergence; at this point we stop incrementing the Gaussian blur. Here, we assist to the sparsity phenomenon: the KL-term is suddenly pushing variance towards 1, with the result of decreasing the KL-divergence, but also causing a sudden and catastrophic collapse of the reconstruction gain of the latent variable.

Contrarily to what is frequently believed, sparsity seems to be reversible, at some extent. If we remove noise from the variable, as soon as the network is able to perceive a potentiality in it (that may take several epochs, as evident if Figure 3, it will eventually make a suitable use of it. Of course, we should not expect to recover the original information gain, since the network may have meanwhile learned a different repartition of roles among latent variables.

About the Kullback-Leibler divergence in Variational Autoencoders

22/12/2018

In this article we shall try to provide an intuitive explanation of the Kullback-Leibler component in the objective function of Variational Autoencoders (VAEs). Some preliminary knowledge of VAEs could be required: see e.g. Doersh’s excellent tutorial for an introduction to the topic.

* * *

Variational Autoencoders (VAEs) are a fascinating facet of autoencoders supporting random generation of new data samples. The log-likelihood log(P(X)) of a data sample X is approximated by a term known as evidence lower bound (ELBO), defined as follows
 
Ez∼Q(z|X) log P(X|z) − KL(Q(z|X)||P(z))          (1)
 
where E denotes an expected value and KL(Q||P) is the Kullback-Leibler divergence of Q from P.

You should think of Q(z|X) as an encoder mapping data X in a vector of random variables z, and P(X|z) as a decoder, reconstructing the input given its encoding. P(z) is a prior distribution of latent variables: typically a normal distribution.

The first term in equation (1) is simply a distance of the reconstruction from the original: if P(X|z)  has a Gaussian distribution around some decoder function d(z), its logarithm is a quadratic loss between X and d(zX), where zX is the encoding for X. The interesting point is that, at training time, instead of precisely using zX for reconstructing the input image (as we shall do in a traditional autoencoder), we sample around this point according to the (learned) distribution Q(z|X). Supposing Q(z|X) has a normal shape, learning the distribution means learning its moments: the mean value zX, and the variance σ2(X).

Intuitively, you can imagine the area around zX with dimension σ2(X) as the portion of the latent space able to produce a reconstruction close to the original X: for this reason, we expect σ2(X) to be a small value: we do not want encodings relative to different data overlap each other.

In this video, we describe the trajectories in a binary latent space followed by ten random digits of the MNIST dataset (one for each class) during the first epoch of training. The animation is summarized in Figure 1, where we use a fading effect to describe the evolution in time.

Figure 1: Trajectories followed in a bidimensional latent space
by ten MNIST digits during the first trainin epoch.

For each digit, the area of the circle is the variance computed by the VAE (more precisely, r2 is the geometric average of the two variances along x and y). Initially it is close to 1, but it rapidly decreases to a very small dimension; this is not surprising, since we need to find place for 60000 different digits! Also, observe that all digits progressively distribute around the center, in a Gaussian-like distribution. The Gaussian distribution is better appreciated in Figure 2, where we describe the position and “size” of 60 digits (6 for each class) after 10 training epochs.

Figure 2: position and variance of 60 MNIST digits after 10 epochs of training. Observe the Gaussian-like distribution and, especially, the really small values of the variance

 

Let us also remark the really small values of the variance σ2(X) for all data X (expressed by the area of the circle). Actually, they would further decrease along the prosecution of training.

The puzzling nature of the Kullback-Leibler term

The questions we shall try to answer to are the following:

  1. Why are we interested in learning the variance σ2(X)? Note that this variance is not used during generation of new samples, since in that case we sample from the prior Normal distribution (we do not have any X!). The variance σ2(X) is only used for sampling during training, but even the relevance of such an operation (apart, possibly, for improving the robustness of the generator) is not evident, especially since, as we have seen, σ2(X) is typically very small! As a matter of fact, the main operational purpose of sampling during training is precisely to learn the actual value of σ2(X),  that takes us back to the original question: why are we interested in learning σ2(X)?
  2. The purpose of the Kullback-Leibler component in the objective
    function is to bring the probability of Q(z|X) close to a normal G(0,1) distribution. That sounds crazy: if, for any X, Q(z|X) is normal, we would have no way to distinguish the different inputs in the latent space. In this case too, we may understand that we try to keep the mean value zX close to 0 with some quadratic penalty, in order to achieve the expected Normal distribution of latent variables (needed for generative sampling), but why are we trying to keep σ2 close to 1?  If for a pair of different inputs X’ and X” the corresponding Gaussians Q(z|X’) = G(zX’2(X’)) and Q(z|X”) = G(zX”2(X”)) overlap too much, we would have no practical way to distinguish the two points. The mean values zX’  and zX” cannot be too far away from each other, since we expect them to be normally distributed around 0, so we eventually expect the variance σ2(X) to be really small for any X (close to 0), that is what happens in practice. But if this is the expected behavior, why do we have as part of our learning objective to keep σ2(X) close to 1?

The kind of answers we are looking for are not on a theoretical level:
the mathematics behind variational aoutoencoders is neat, although
not very intuitive. Our purpose is to obtain some empirical evidence that could help us to better grasp the underlying theory.

A closer look at the Kullback-Leibler component

Before addressing the previous questions, let’s have a closer look at the Kullback-Leibler divergence in equation (1). Supposing that Q(z|X) has a Gaussian shape G(zX2(X)) and the prior P(z) is a normal G(0,1) distribution, we can compute it in closed form:

KL(G(zX2(X))||G(0,1)) = 1/2(zX2 + σ2(X) – log(σ2(X)) -1)       (2)

The term zX2 is a quadratic penalty over encodings meant to keep them around 0. The second part σ2(X) – log(σ2(X)) -1 is pushing σ2(X) towards the value 1. Note that by removing this part, sampling during training would loose any interest: if σ2(X) is not contrasted in any way, it would tend to  go to 0: the distribution Q(z|X) would collapse to a Dirac distribution around zX, making sampling pointless.

So, mathematically, sampling at training time allows us to estimate σ2(X), and we are interested in computing σ2(X) because of its usage in equation (2), where we try to contrast the natural tendency of σ2(X) to collapse to 0 by pushing it towards 1. What is still to be understood is the actual purpose of this operation.

Take up as much space as you deserve

The practical purpose of the previous mechanism is to induce each data point X to take into the latent space as much space as it deserves, compatibly with the similar requirement of other points. How much space can it take? In principle, all the available space, that is precisely the reason why we try to keep the variance σ2(X) close to 1.

In practice, you force each data point X to compete with all other points in the occupancy of the latent space, each one pushing the others as far away as possible. This operation should hopefully result into a better coverage of the latent space, that should in turn produce a better generation of new samples.

Let’s try to get some empirical evidence of this fact.

In the case of MNIST, we start getting some significant empirical evidence when considering a sufficiently deep architecture in a latent space of dimension 3 (with 2 dimensions it is difficult to appreciate the difference) In Figure 3, we show the final distribution of 5000 MNIST digits in a 3-dimensional latent space with and without sampling during training (in the case without sampling we keep the quadratic penalty on zX). We also show the result of generative sampling from the latent space, organized in five horizontal slices of 25 points each. For this example we used a dense 784-256-64-16-3 architecture.

Figure 3: Distribution of 5000 MNIST digits in a 3D latent space,
with sampling at training time (left) and without it (right). 

We may observe that sampling during training induces a much more regular disposition of points in the latent space. In turn, this results in a drastic improvement in the quality of randomly generated images.

Does this scale to higher dimensions? Is the Kullback-Leibler component really enough to induce a good coverage of the latent space in view of generative sampling?

Apparently, yes, at a good extent. But in higher dimensions there is an even more interesting effect produced by the Kullaback-leibler divergence, that we shall discuss in the next article: the latent representation is getting sparse!