Sparsity in Variational Autoencoders

In our previous article, we discussed the regularization effect of the Kullback-Leibler divergence in the objective function of Variational Autoencoders, providing empirical evidence that it results in a better coverage of the latent space.

In this article, we shall discuss another important effect of it: working in latent spaces of sufficiently high-dimension, the latent representation becomes sparse. Many latent variables are zeroed-out (independently from the input), the associated variance computed by the network is around one (while the real variance is close to 0), and in any case those variables are neglected by the decoder.

This property is usually known under the name of over-pruning, since it induces the model to only use small number of its stochastic units. In fact, this is a form of sparsity, with all the benefits typically associated with this form of regularization.

* * *

Sparsity is a well known and desirable property of encodings: it forces the model to focus on the relevant features of data, usually resulting in more robust representations, less prone to overfitting. 

Sparsity is typically achieved in neural networks by means of weight-decay L1 regularizers, directly acting on weights. Remarkably, the same behaviour is induced in Variational Autoencoders by the Kullback-Leibler divergence, simply acting on the variance of the encoding distribution Q(z|X).

The most interesting consequence is that, at least for a given architecture, there seems to exist an intrinsic internal dimension of data. This property can be exploited both to understand if the network has sufficient internal capacity, augmenting it to attain sparsity,  or conversely to reduce the dimension of the network removing links to unused neurons. Sparsity also help to explain a loss of variability in random generative sampling from the latent space one may sometimes observe with variational autoencoders

In the following sections we shall investigate sparsity for a couple of typical test cases.

MNIST

We start with the well known MNIST dataset of handwritten digits that we already used in our previous article. Our first architecture is a dense network with dimensions 784-256-64-32-16.

In Figure 1 we show the evolution during a typical training of the variance of the 16 latent variables.

Figure 1: evolution of the variance along training (16 variables, MNIST
case).  On the x-axis we have numbers of minibatches, each one of size 128. 
The variance of some variables goes to zero (as expected), but for other variables it goes to 1, instead. The latter variables will be neglected by the decoder.

Table 1 provides relevant statistics for each latent variable at the end of training, computed over the full dataset: the mean of its variance (that we expect to be around 1, since it should be normally distributed), and the mean of the computed variance σ2(X) (that we expect to be a small value, close to 0). The mean value is around 0 as expected, and we do not report it.

no variance mean(σ2(X))
0 8.847272e-05 0.9802212
10.000117560.99551463
26.665453e-050.98517334
30.974179270.008741336
40.991318170.006186147
51.00123430.010142518
60.945633770.057169348
70.000158410.98205334
80.946942750.033207607
90.000147890.98505586
101.00403750.018151345
110.985438760.023995731
120.0001074410.9829797
134.5068125e-050.998983
140.00010850.9604088
150.98863780.044405878
Table 1: inactive variables in the 784-256-64-32-16 VAE for MNIST digits

All variables highlighted in red have an anomalous behavior: their variance is very low (in practice, they always have value 0), while the variance σ2(X) computed by the network is around 1 for each X. In other words, the representation is getting sparse! Only 8 latent variables out of 16 are in use: the other ones are completely ignored by the generator.

As an additional confirmation of this fact, in Figure 2 we show a few digits randomly generated from Gaussian sampling in the latent space (upper line) and the result of generation when inactive latent variables have been zeroed-out (lower line): they are indistinguishable.

Figure 2: Upper line: digits generated from a vector of 16 normally sampled latent variables. Lower line: digits generated after “red” variables have been zeroed-out;
these latent variables are completely neglected by the generator

Let’s try with a convolutional VAE. We consider a relatively sophisticated network, with the following structure (see Figure 3)

Figure 3: arcbitecure of the convolutional VAE

In this case, sparsity is less evident, but still present: 3 variables out of 16 are inactive.

no variance mean(σ2(X))
00.000159390.99321973
10.91644579015826659
21.054788820.0062623904
30.981025690.012602937
41.073532930.0051363203
51.069324970.0066873398
66.477744e-050.96163213253
71.119550340.0031915947
80.887556430.024708110
90.979433000.0094883628
100.93228560.016983853
111.400598260.0025208105
121.302275650.0033756110
130.000193370.99533605
141.135975830.0076088942
151.334825630.002084088
Table 1: inactive variables in the conv-VAE for MNIST digits

Having less sparsity seems to suggest that convolutional networks make a better exploitation of latent variables, typically resulting in a more precise reconstruction and improved generative sampling.
This is likely due to the fact that latent variables encode information corresponding to different portions of the input space, and are less likely to become useless for the generator.

The sparsity phenomenon was first highlighted by Burda et al. in this work, and later confirmed by many authors on several different datasets and neural architectures. We discuss this debated topic and survey the recent literature in this article, where we also investigate a few more concrete examples.

The degree of sparsity may slightly vary from training to training, but not in a significant way (at least, for similar final values of the loss function). This seems to suggest that, given a certain neural architecture, there exists an intrinsic, “optimal” compression of data in the latent space. If the network does not exhibit sparsity, it is probably a good idea to augment the dimension of the latent space; conversely, if the network is sparse we may reduce its dimension removing inactive latent variables and their connections

Kullback-Leibler divergence and sparsity

Let us consider the loglikelihood for data X:

Ez∼Q(z|X) log P(X|z) − KL(Q(z|X)||P(z))

If we remove the Kullback-Leibler component from previous objective function, or just keep the quadratic penalty on latent variables, the sparsity phenomenon disappears. So, sparsity must be related to that part of the loss function.

It is also evident that if the generator ignores a latent variable, P(X|z) will not depend on it and the loglikelihood is maximal when the distribution of Q(z|X) is equal to the prior distribution P(z), that is just a normal distribution with 0 mean and standard deviation 1. In other words, the generator is induced to learn a trivial encoding zX = 0 and a (fake) variance σ2(X) =1. Sampling has no effect, since the sampled value for zX will just be ignored.

Intuitively, if during training a latent variable is of moderate interest for reconstructing the input (in comparison with the other variables), the network will learn to give less importance to it; at the end, the Kullback-Leibler divergence may prevail, pushing the mean towards 0 and the standard deviation towards 1. This will make the latent variable even more noisy, in a vicious loop that will eventually induce the network to completely ignore the latent variable.

Advertisements

About the Kullback-Leibler divergence in Variational Autoencoders

22/12/2018

In this article we shall try to provide an intuitive explanation of the Kullback-Leibler component in the objective function of Variational Autoencoders (VAEs). Some preliminary knowledge of VAEs could be required: see e.g. Doersh’s excellent tutorial for an introduction to the topic.

* * *

Variational Autoencoders (VAEs) are a fascinating facet of autoencoders supporting random generation of new data samples. The log-likelihood log(P(X)) of a data sample X is approximated by a term known as evidence lower bound (ELBO), defined as follows
 
Ez∼Q(z|X) log P(X|z) − KL(Q(z|X)||P(z))          (1)
 
where E denotes an expected value and KL(Q||P) is the Kullback-Leibler divergence of Q from P.

You should think of Q(z|X) as an encoder mapping data X in a vector of random variables z, and P(X|z) as a decoder, reconstructing the input given its encoding. P(z) is a prior distribution of latent variables: typically a normal distribution.

The first term in equation (1) is simply a distance of the reconstruction from the original: if P(X|z)  has a Gaussian distribution around some decoder function d(z), its logarithm is a quadratic loss between X and d(zX), where zX is the encoding for X. The interesting point is that, at training time, instead of precisely using zX for reconstructing the input image (as we shall do in a traditional autoencoder), we sample around this point according to the (learned) distribution Q(z|X). Supposing Q(z|X) has a normal shape, learning the distribution means learning its moments: the mean value zX, and the variance σ2(X).

Intuitively, you can imagine the area around zX with dimension σ2(X) as the portion of the latent space able to produce a reconstruction close to the original X: for this reason, we expect σ2(X) to be a small value: we do not want encodings relative to different data overlap each other.

In this video, we describe the trajectories in a binary latent space followed by ten random digits of the MNIST dataset (one for each class) during the first epoch of training. The animation is summarized in Figure 1, where we use a fading effect to describe the evolution in time.

Figure 1: Trajectories followed in a bidimensional latent space
by ten MNIST digits during the first trainin epoch.

For each digit, the area of the circle is the variance computed by the VAE (more precisely, r2 is the geometric average of the two variances along x and y). Initially it is close to 1, but it rapidly decreases to a very small dimension; this is not surprising, since we need to find place for 60000 different digits! Also, observe that all digits progressively distribute around the center, in a Gaussian-like distribution. The Gaussian distribution is better appreciated in Figure 2, where we describe the position and “size” of 60 digits (6 for each class) after 10 training epochs.

Figure 2: position and variance of 60 MNIST digits after 10 epochs of training. Observe the Gaussian-like distribution and, especially, the really small values of the variance

 

Let us also remark the really small values of the variance σ2(X) for all data X (expressed by the area of the circle). Actually, they would further decrease along the prosecution of training.

The puzzling nature of the Kullback-Leibler term

The questions we shall try to answer to are the following:

  1. Why are we interested in learning the variance σ2(X)? Note that this variance is not used during generation of new samples, since in that case we sample from the prior Normal distribution (we do not have any X!). The variance σ2(X) is only used for sampling during training, but even the relevance of such an operation (apart, possibly, for improving the robustness of the generator) is not evident, especially since, as we have seen, σ2(X) is typically very small! As a matter of fact, the main operational purpose of sampling during training is precisely to learn the actual value of σ2(X),  that takes us back to the original question: why are we interested in learning σ2(X)?
  2. The purpose of the Kullback-Leibler component in the objective
    function is to bring the probability of Q(z|X) close to a normal G(0,1) distribution. That sounds crazy: if, for any X, Q(z|X) is normal, we would have no way to distinguish the different inputs in the latent space. In this case too, we may understand that we try to keep the mean value zX close to 0 with some quadratic penalty, in order to achieve the expected Normal distribution of latent variables (needed for generative sampling), but why are we trying to keep σ2 close to 1?  If for a pair of different inputs X’ and X” the corresponding Gaussians Q(z|X’) = G(zX’2(X’)) and Q(z|X”) = G(zX”2(X”)) overlap too much, we would have no practical way to distinguish the two points. The mean values zX’  and zX” cannot be too far away from each other, since we expect them to be normally distributed around 0, so we eventually expect the variance σ2(X) to be really small for any X (close to 0), that is what happens in practice. But if this is the expected behavior, why do we have as part of our learning objective to keep σ2(X) close to 1?

The kind of answers we are looking for are not on a theoretical level:
the mathematics behind variational aoutoencoders is neat, although
not very intuitive. Our purpose is to obtain some empirical evidence that could help us to better grasp the underlying theory.

A closer look at the Kullback-Leibler component

Before addressing the previous questions, let’s have a closer look at the Kullback-Leibler divergence in equation (1). Supposing that Q(z|X) has a Gaussian shape G(zX2(X)) and the prior P(z) is a normal G(0,1) distribution, we can compute it in closed form:

KL(G(zX2(X))||G(0,1)) = 1/2(zX2 + σ2(X) – log(σ2(X)) -1)       (2)

The term zX2 is a quadratic penalty over encodings meant to keep them around 0. The second part σ2(X) – log(σ2(X)) -1 is pushing σ2(X) towards the value 1. Note that by removing this part, sampling during training would loose any interest: if σ2(X) is not contrasted in any way, it would tend to  go to 0: the distribution Q(z|X) would collapse to a Dirac distribution around zX, making sampling pointless.

So, mathematically, sampling at training time allows us to estimate σ2(X), and we are interested in computing σ2(X) because of its usage in equation (2), where we try to contrast the natural tendency of σ2(X) to collapse to 0 by pushing it towards 1. What is still to be understood is the actual purpose of this operation.

Take up as much space as you deserve

The practical purpose of the previous mechanism is to induce each data point X to take into the latent space as much space as it deserves, compatibly with the similar requirement of other points. How much space can it take? In principle, all the available space, that is precisely the reason why we try to keep the variance σ2(X) close to 1.

In practice, you force each data point X to compete with all other points in the occupancy of the latent space, each one pushing the others as far away as possible. This operation should hopefully result into a better coverage of the latent space, that should in turn produce a better generation of new samples.

Let’s try to get some empirical evidence of this fact.

In the case of MNIST, we start getting some significant empirical evidence when considering a sufficiently deep architecture in a latent space of dimension 3 (with 2 dimensions it is difficult to appreciate the difference) In Figure 3, we show the final distribution of 5000 MNIST digits in a 3-dimensional latent space with and without sampling during training (in the case without sampling we keep the quadratic penalty on zX). We also show the result of generative sampling from the latent space, organized in five horizontal slices of 25 points each. For this example we used a dense 784-256-64-16-3 architecture.

Figure 3: Distribution of 5000 MNIST digits in a 3D latent space,
with sampling at training time (left) and without it (right). 

We may observe that sampling during training induces a much more regular disposition of points in the latent space. In turn, this results in a drastic improvement in the quality of randomly generated images.

Does this scale to higher dimensions? Is the Kullback-Leibler component really enough to induce a good coverage of the latent space in view of generative sampling?

Apparently, yes, at a good extent. But in higher dimensions there is an even more interesting effect produced by the Kullaback-leibler divergence, that we shall discuss in the next article: the latent representation is getting sparse!