The large umbrella of Artificial Intelligence

It is becoming increasingly frequent to see introductions to deep learning where the subject is presented as a subtopic of Machine Learning, that is in turn described as a particular field of Artificial Intelligence.

While this presentation can indubitably have some didactic relevance, in particular for explaining what components of a software algorithm the machine is supposed to learn, from a more historical perspective it is quite confusing and it does not help to understand the mutual relations between the different fields and their evolution.

First of all, it is important to understand that Artificial Intelligence has not always been as popular as it is at present but, similarly to most scientific topics, traversed periods of mixed fortunes,  alternating waves of love and disaffection. In particular, there are a couple of exceptionally gloom periods, known as AI-winters, during years 1974–1980 and 1987–1993, the first one following the famous Lighthill’s report, and the second one as a consequence of the failure of the “5th-generation computers” project.

As it is normal, during favourable periods many research fields tend to rally under the highly mediatic banner of AI, while in gloomy days they tend to better mark and confine their territory, making distinctions, and entrenching themselves on specific, highly specialized topics, and clearly identifiable methodologies.

The situation at the beginning of the century

Let’s have a look at the state of the discipline at the beginning of the century.

AI is still slowly recovering from its second winter, also suffering from the emerging (separated) field of Machine Learning (more on it later). Traditional AI is dominated by the historical topics: knowledge representation, expert systems and (constraint) logic programming. Neural Networks are indeed a part of AI, but play an extremely marginal role. We are still in an epoch of shallow networks, with most of the attention focused on recurrent NN and the recent LSTM models; networks are slow, difficult to train, and not particularly effective. The general perception is that NN is a dying topic, with no perspectives. In addition, due to their biological inspiration, they are looked at with particular suspicion from the majority of AI researchers, in view of the on going and somewhat surreal discussion about “strong” versus “weak” AI.

The lively, emerging field is Machine Learning. In this case, the trendy topic is Support Vector Machines (SVM) , followed, at great distance, by Bayesian models. Machine Learning is not interested in presenting itself as a subfield of AI; on the contrary, it tries to emphasize its distinctive methodologies, and the more solid, scientific background. It is instructive to see that in the book “Pattern Recognition and Machine Learning” by Bishop, one of the pillars of the discipline, there is not a single mention of Artificial Intelligence (but for the names of a few conferences in references). In this case too, Neural Networks are a subfield of ML (in Bishop’s book there is a full chapter on these models): they share with ML terminology, methodologies and relevant techniques. However, again, their role is absolutely marginal, and shallow networks are systematically outperformed by different techniques.

AI, ML and (shallow) Neural Networks at the beginning of the century

The two communities of AI and ML ignored (read, cordially detested) each other; moreover, they jointly hardly tolerated people working on Neural Networks, both for the already mentioned reasons, and the guiltless habit of “keeping foot in two worlds”.

A myriad of specialized topics

In addition to the above areas of research, further specialization on particular domains of application – vision, natural language processing, optical character recognition, speech comprehension, data mining, decision theory, robotics, … – contributed to fragment AI into a myriad of sub-fields that had essentially nothing to say to each other, and no interest in exchanging knowledge.

Fragmentation of AI/ML research aound year 2000 (only major fields are reported)

The case of Natural Language Processing (NLP) is paradigmatic. During the sixties, NLP was one of the main areas of application of AI, aggressively supported by government, mostly interested in the potential military applications of this line of research. However, results were modest and in 1964, the National Research Council created a commission – the Automatic Language Processing Advisory Committee (ALPAC) – to investigate the problem. The famous report, delivered in 1966, was extremely negative on the perspective of the field.

While the ALPAC report caused the end of the NRC direct financial support, more money arrived through the Defense Advanced Research Projects Agency (DARPA, previously known as ARPA). Another famous failure is the Speech Understanding Research program of the Carnegie Mellon University (CMU), generating a progressive frustration of DARPA (1971-74) that finally resulted in the cancellation of an annual grant of three million dollars (1974).

After the first winter of AI, the paths of AI and NLP start to diverge, giving rise to distinct communities, with little or no communication between them. In the ’80, NLP is mostly dominated by symbolic methods, but during the period between 1990 and 2010 there is a progressive use of machine learning algorithms, giving rise to the so called Statistical NLP. With the advent of the web at the beginning of the new century, increasing amounts of raw (unannotated) language data are finally available, favouring the development of statistical approaches, and particularly stimulating research on unsupervised and semi-supervised learning algorithms.

The advent of Deep Learning

What changed everything, was the advent of Deep Learning.

Research on deep neural networks has been going on for years, however the starting date of the Deep Learning “Revolution” is usually fixed in 2012, when a number of unexpected and remarkable results focused back the interest of many different communities on Neural Networks. An emblematic event is the ImageNet competition in October 2012, where the so called “Alex” net, a deep convolutional NN by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton won the challenge by a significant margin over shallow machine learning methods. According to many authors, the ImageNet victory in 2012 is a sort of landmark for the new era of “deep learning”.

Since then, Deep Learning techniques rapidly fagocitated many of the fields that had previously departed from AI, imposing itself as a unifying and comprehensive framework.

The Deep Learning Revolution in AI

To make an example, since around 2015, NLP has essentially abandoned statistical methods in favour of Deep Neural Networks. This shift entailed substantial changes in the design of NLP systems, usually characterized by end-to-end learning of high-level tasks, in contrast to the typical pipeline of statistical techniques. For instance, in Neural Machine Translation (NMT) the network is directly trained to learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation.

Computer vision is another field where Deep Learning algorithms have completely revolutionized the state-of-the-art. Object detection, semantic segmentation, face recognition, image denoising, super-resolution, 3D shaping, are just some of the many research areas where deep neural networks have replaced traditional techniques. In many cases, Deep Learning is proposing innovative challenges, as in the case of “panoptic” segmentation, or the key-point detection task for pose estimation.

Robotics too is currently dominated by Deep Reinforcement Learning (DRL) algorithms. In this case, the use of Deep Neural Networks allows to address the scalability problem of most of the traditional RL techniques, avoiding to explicitly model the state space, and using Neural Networks as function approximators for the relevant functions.

Conclusions

In conclusion, the fact of presenting DL as a subfield of ML that is a subfield of AI is somewhat reductive and does not completely reflect the complexity and amplitude of the phenomenon. Deep Learning, being based on neural network, is indeed an historical component of AI (differently from ML). On the other side, its techniques are surely closer to ML than to other fields of AI. In any case, DL is the real novelty of the renewed AI, and its beating heart. When you read news about new AI achievements in newspapers or other media, in the large majority of cases, those results have been obtained by deploying Deep Learning techniques. The diffusion of DL is pervasive, and for the moment the trend shows no signs of slowing down.

Loss of Variance in Variational Autoencoders

Variational Autoencoders (VAEs), in comparison with alternative generative techniques, usually produce images with a characteristic and annoying blurriness.

A possible explanation can be given in terms of an alternative problem, that really seems to be the opposite facet of the same phenomenon: the variance of generated data is significantly lower than that of training data.

This problem of VAEs is not particularly known, that is precisely the reason why I decided to write a post on this argument.

First of all, what variance are we interested in?

Typically, the output of the network is a tensor with shape B\times D where B is the batchsize and D is the dimension of the output. To fix the ideas, let us suppose that D is a coloured image, so that D=W\times H\times 3. We want to measure the variance of each single pixel along the batchsize axis, that is how the pixel varies on the data under consideration. Then, we compute the mean of all variances we obtained in this way.

Working with numpy, the formula we are interested in is

np.mean(np.var(x, axis=0))

Now, we compute this expression on a large number of real data (e.g. on the validation set), on an equal number of reconstructed (or generated) images, and take the difference between them:

\begin{array}{rl}\Delta_{var} = &np.mean(np.var(x, axis=0))\\ & -np.mean(np.var(\hat{x}, axis=0))\end{array}

You can take your favourite VAE and make the above simple computation. We would expect the two values to be approximately equal. Especially, there would be no reason to expect that the variance of one of the two sets should be higher or lower than the other: the difference could be positive as well as negative.

But this is not the case. As you will discover, the difference is positive and not negligeable, that precisely means that we are testifying a loss of variance in generated data.

The entity of the variance loss

Can we estimate the amount of this loss? Experimentally, it turns out to be approximately equal to the mean squared error between real and reconstructed images:

mse = np.mean(np.square(x - \hat{x}))

Note that, contrarily to the previous formula, mse is positive by definition.

To make a working example, consider the simple variational autoencoder for Mnist defined by F.Choillet in this post: Building autoencoders in keras.

In this case (after 50 epochs of training), we have

\begin{array}{rl} mse = & 0.0413 \\ \Delta_{var} = & 0.0411 \end{array}

A lucky coincidence, you might say. Well, quoting master Kenobi, let me say that, in my experience, there is no such thing as luck.

Since I conjectured the above relation, I systematically measured it on all the VAEs I had the oppurtunity to work with. In a couple of years I collected a quite large number of data, summarized in the Figure below.

Relation between mean squared error and variance loss.
The distribution is close to the diagonal.

The different colors refer to different neural architectures: blue = Dense Networks; red = ResNet-like; green = Convolutional Networks; yellow = Iterative Networks (DRAW, GQN-like). The point is not to emphasize the difference between the various architectures (that would require a more extensive evaluation), but to remark that the relation between \Delta_{vae} and mse is systematic, and largely independent from the specific architecture.

Of course, if the variance loss is of the same magnitude of the mean squared error, reducing the reconstruction error may also implicitly solve the variance issue. However, loglikelihood, in conjunction with the underlying averaging due to dimensionality reduction and the variational approach, really seems to be the cause of the loss, so merely trying to improve reconstruction could not be the right approach.

Intuitively, VAEs are, by their nature, extremely prudent in their predictions (at the very opposite of GANs). Even when generated samples are really good, as in the case of the recent NVAE, they still have a patinated look, and an extremely conventional appearance.

What we would like is to find a way to force a bit more of “verve” in their behaviour, increasing the “temperature” of generated samples.

If you are interested, you may find additional details on this topic in this article:

Andrea Asperti, Variance Loss in Variational Autoencoders. Sixth International Conference on Machine Learning, Optimization, and Data Science. July 19-23, 2020 – Certosa di Pontignano, Siena, Italy

Calibrating VAEs

I recently obtained the best generative results ever achieved for Variational Autoencoders.

Below is a sample of the kind of faces I was able to generate using CelebA as training set, with a moderate use of computing resources:

The result is based on two different insights, that I will briefly discuss in the following sections:

  • a new balancing policy between reconstruction error and kullback-leibler divergence in the VAE loss function
  • a renormalization operation for two stage VAEs, compensating the loss of variance typical of (Variational) autoencoders.

Balancing reconstruction error and KL-divergence

The loss function of Variational autoencoders has the following shape:

\underbrace{\mathbb{E}_{z \sim Q(z|X)}log(P(X|z))}_{\mbox{reconstruction error}} + \underbrace{KL(Q(z|X)||P(z))}_{\mbox{KL-divergence}}

The first component is meant to enhance the quality of reconstructed images, while the second component is acting as a regularizer of the latent space, pushing it toward the prior distribution P(Z). The two components have contrasting effects, and their balance is a delicate issue.

Our solution consists in keeping a constant balance along training. Supposing P(X|z) has a Gaussian shape, the log-likelihood is just the mean squared error between the reconstructed image \hat{X} and the original sample X. During training, we normalize this component by an estimation of the current error, computed over minibatches. In this way, the KL component cannot easily prevail, a fact that would forbid any further improvement in the quality of reconstructions.

More details can be found in this article:
Andrea Asperti, Matteo Trentin, Balancing recosntruction error and Kullback-Leibler divergence in Variational Autoencoders, IEEEAccess , vol. 8, pp. 199440-199448, 2020, doi: 10.1109/ACCESS.2020.3034828.

Variance Loss and renormalization

Data Generated by variational auotencoders seem to suffer from a systematic loss of variance with respect to the training set. The phenomenon is likely due to averaging, that in turn could be caused by dimensionality reduction (as in the case of PCA), or by sampling in the latent space, typical of the variational approach.

This is relevant, since generative models are evaluated with metrics such as the Frechet InceptionDistance (FID) that precisely compare the distributions of (features of) real versus generated images.

The variance loss becomes particularly dangerous in a two stage setting, where a second VAE is used to sample in the latent space of the first VAE. The reduced variance creates a mismatch between the actual distribution of latent variables and those generated by the second VAE, that hinders the beneficial effects of the second stage.

A simple solution is just to renormalize the output of the second VAE towards the expected normal spherical distribution (or better towards the moments of this distribution as e.g. computed by the variance law discussed in one of my previous posts).

This simple operation typically results in a sensible improvement in the perceptual quality of generated images, and a remarkable burst in terms of FID.

In the image below, you see the effect of our technique on images generated from a same seed: on the left you have the original image, and on the right the result of the latent space renormalization.

More details can be found in this article:
Andrea Asperti Variance Loss in Variational Autoencoders, arXiv preprint arXiv:2002.09860 (2020).

A stationary condition for Variational Autoencoders

14/2/2019

In this article, we continue our investigation of Variational Autoencoders (see our previous posts on the regularization effect of the Kullback-Leibler divergence, and the sparsity phenomenon). In particular, we shall point out an interesting stationary condition induced by the Kullback-Leibler component of the objective function.


Let us first of all observe that trying to compute relevant statistics for the posterior distribution Q(z|X) of latent variables without some kind of regularization constraint does not make much sense. As a matter of fact, given a network with mean \mu_z(X) and variance \sigma_z^2(X) for a given latent variable z, we can easily build another one having precisely the same behavior by scaling mean and standard deviation by some constant γ (for all data, uniformly), and then downscaling the generated samples in the next layer of the network. This kind of linear transformations are easily performed by any neural network (it is the same reason why it does not make much sense to add a batch-normalization layer before a linear layer).

Let’s see how the KL-divergence helps to choose a solution. In the following, we suppose to work on a specific latent variable z, omitting to specify it. Starting from the assumption that for a network it is easy to keep a fixed ratio \rho^2(X) = \sigma^2(X)/\mu^2(X), we can push this value in the closed form of the Kullback-Leibler divergence, that is

{\mathit{KL}(G(\mu(X),\sigma^2(X))||G(0,1))= (\mu(X)^2+\sigma^2(X) -\mathit{log}(\sigma^2(X)) -1) /2}

getting the following expression:

(\sigma^2(X)\frac{1 + \rho^2(X)}{\rho^2(X)}-log(\sigma^2(X))-1)/2  \hspace{1cm}(1)

In Figure 1, we plot the previous function in terms of the variance, for a few given values of \rho.

KL-divergence for different values of \rho:
observe the strong minimum for small values of \rho.

The above function has a minimum for

{\sigma^2(X) = \frac{\rho^2(X)}{1+\rho^2(X)}}\hspace{1cm}(2)

close to 0 when \rho is small, and close to 1 when \rho is high. Of course \rho depends on X, while the rescaling operation after sampling must be uniform, still the network will have a propensity to synthesize standard variations close to 0 \le \frac{\rho^2(X)}{1+\rho^2(X)} <1 (below we shall average on all X).

Substituting the definition of \rho^2(X) in equation (2), we expect to reach a minimum when \sigma^2(X) = \frac{\sigma^2(X)}{\mu^2(X)}\frac{\mu^2(X)}{\mu^2(X)+\sigma^2(X)}, that, by trivial computations, implies the following simple stationary condition:

\sigma^2(X) + \mu^2(X) = 1

Let us now average together the KL components for all data X:

\frac{1}{N}\sum_X \frac{1}{2}(\mu(X)^2 + \sigma^2(X)-log(\sigma^2(X)) -1)

We use the notation \widehat{f(X)} to abbreviate the average \frac{1}{N}\sum_X f(X) of f(X) on all data X.
The ratio \rho^2 = \frac{\widehat{\sigma^2(X)}}{\widehat{\mu^2(X)}} can really (and easily) be kept constant by the net. Let us also observe that, assuming the mean of the latent variable to be 0, \widehat{\mu^2(X)} is just the (global) variance \sigma^2 of the latent variable.

Pushing \rho^2 in the previous equation, we get

{\frac{1}{2}(\widehat{\sigma^2(X)}\frac{1 + \rho^2}{\rho^2}-\widehat{log(\sigma^2(X))} -1)}

Now we perform a somewhat rough approximation. The average of the logarithms \widehat{log(\sigma^2(X))} is the logarithm of the geometric mean of the variances. If we replace the geometric mean with an arithmetic mean, we get an expression essentially equivalent to expression (1), namely

{\frac{1}{2}(\widehat{\sigma^2(X)}\frac{1 + \rho^2}{\rho^2}-log(\widehat{\sigma^2(X)}) -1)}

that has a minimum when

\widehat{\sigma^2(X)} = \frac{\rho^2}{1+\rho^2}

that implies

\widehat{\sigma^2(X)} + \widehat{\mu^2(X)} = 1

or simply,

\widehat{\sigma^2(X)} + \sigma^2 = 1\hspace{1cm}(3)

where we replaced \widehat{\mu^2(X)} with the variance \sigma^2 of the latent variable in view of the consideration above.

Condition (3) can be experimentally verified. In spite of the rough approximation we did to get it, it proves to be quite accurate, provided the Variational Autoencoder is sufficiently trained. You can check it on your own experiments, or compare it with the data provided in our previous posts.

Let us finally remark that condition (3) is supposed to hold both for active and inactive variables.

Sparsity in Variational Autoencoders

In our previous article, we discussed the regularization effect of the Kullback-Leibler divergence in the objective function of Variational Autoencoders, providing empirical evidence that it results in a better coverage of the latent space.

In this article, we shall discuss another important effect of it: working in latent spaces of sufficiently high-dimension, the latent representation becomes sparse. Many latent variables are zeroed-out (independently from the input), the associated variance computed by the network is around one (while the real variance is close to 0), and in any case those variables are neglected by the decoder.

This property is usually known under the name of over-pruning, since it induces the model to only use small number of its stochastic units. In fact, this is a form of sparsity, with all the benefits typically associated with this form of regularization.

* * *

Sparsity is a well known and desirable property of encodings: it forces the model to focus on the relevant features of data, usually resulting in more robust representations, less prone to overfitting. 

Sparsity is typically achieved in neural networks by means of weight-decay L1 regularizers, directly acting on weights. Remarkably, the same behaviour is induced in Variational Autoencoders by the Kullback-Leibler divergence, simply acting on the variance of the encoding distribution Q(z|X).

The most interesting consequence is that, at least for a given architecture, there seems to exist an intrinsic internal dimension of data. This property can be exploited both to understand if the network has sufficient internal capacity, augmenting it to attain sparsity,  or conversely to reduce the dimension of the network removing links to unused neurons. Sparsity also help to explain a loss of variability in random generative sampling from the latent space one may sometimes observe with variational autoencoders

In the following sections we shall investigate sparsity for a couple of typical test cases.

MNIST

We start with the well known MNIST dataset of handwritten digits that we already used in our previous article. Our first architecture is a dense network with dimensions 784-256-64-32-16.

In Figure 1 we show the evolution during a typical training of the variance of the 16 latent variables.

Figure 1: evolution of the variance along training (16 variables, MNIST
case).  On the x-axis we have numbers of minibatches, each one of size 128. 
The variance of some variables goes to zero (as expected), but for other variables it goes to 1, instead. The latter variables will be neglected by the decoder.

Table 1 provides relevant statistics for each latent variable at the end of training, computed over the full dataset: the mean of its variance (that we expect to be around 1, since it should be normally distributed), and the mean of the computed variance σ2(X) (that we expect to be a small value, close to 0). The mean value is around 0 as expected, and we do not report it.

no variance mean(σ2(X))
0 8.847272e-05 0.9802212
10.000117560.99551463
26.665453e-050.98517334
30.974179270.008741336
40.991318170.006186147
51.00123430.010142518
60.945633770.057169348
70.000158410.98205334
80.946942750.033207607
90.000147890.98505586
101.00403750.018151345
110.985438760.023995731
120.0001074410.9829797
134.5068125e-050.998983
140.00010850.9604088
150.98863780.044405878
Table 1: inactive variables in the 784-256-64-32-16 VAE for MNIST digits

All variables highlighted in red have an anomalous behavior: their variance is very low (in practice, they always have value 0), while the variance σ2(X) computed by the network is around 1 for each X. In other words, the representation is getting sparse! Only 8 latent variables out of 16 are in use: the other ones are completely ignored by the generator.

As an additional confirmation of this fact, in Figure 2 we show a few digits randomly generated from Gaussian sampling in the latent space (upper line) and the result of generation when inactive latent variables have been zeroed-out (lower line): they are indistinguishable.

Figure 2: Upper line: digits generated from a vector of 16 normally sampled latent variables. Lower line: digits generated after “red” variables have been zeroed-out;
these latent variables are completely neglected by the generator

Let’s try with a convolutional VAE. We consider a relatively sophisticated network, with the following structure (see Figure 3)

Figure 3: arcbitecure of the convolutional VAE

In this case, sparsity is less evident, but still present: 3 variables out of 16 are inactive.

no variance mean(σ2(X))
00.000159390.99321973
10.91644579015826659
21.054788820.0062623904
30.981025690.012602937
41.073532930.0051363203
51.069324970.0066873398
66.477744e-050.96163213253
71.119550340.0031915947
80.887556430.024708110
90.979433000.0094883628
100.93228560.016983853
111.400598260.0025208105
121.302275650.0033756110
130.000193370.99533605
141.135975830.0076088942
151.334825630.002084088
Table 1: inactive variables in the conv-VAE for MNIST digits

Having less sparsity seems to suggest that convolutional networks make a better exploitation of latent variables, typically resulting in a more precise reconstruction and improved generative sampling.
This is likely due to the fact that latent variables encode information corresponding to different portions of the input space, and are less likely to become useless for the generator.

The sparsity phenomenon was first highlighted by Burda et al. in this work, and later confirmed by many authors on several different datasets and neural architectures. We discuss this debated topic and survey the recent literature in this article, where we also investigate a few more concrete examples.

The degree of sparsity may slightly vary from training to training, but not in a significant way (at least, for similar final values of the loss function). This seems to suggest that, given a certain neural architecture, there exists an intrinsic, “optimal” compression of data in the latent space. If the network does not exhibit sparsity, it is probably a good idea to augment the dimension of the latent space; conversely, if the network is sparse we may reduce its dimension removing inactive latent variables and their connections

Kullback-Leibler divergence and sparsity

Let us consider the loglikelihood for data X:

Ez∼Q(z|X) log P(X|z) − KL(Q(z|X)||P(z))

If we remove the Kullback-Leibler component from previous objective function, or just keep the quadratic penalty on latent variables, the sparsity phenomenon disappears. So, sparsity must be related to that part of the loss function.

It is also evident that if the generator ignores a latent variable, P(X|z) will not depend on it and the loglikelihood is maximal when the distribution of Q(z|X) is equal to the prior distribution P(z), that is just a normal distribution with 0 mean and standard deviation 1. In other words, the generator is induced to learn a trivial encoding zX = 0 and a (fake) variance σ2(X) =1. Sampling has no effect, since the sampled value for zX will just be ignored.

Intuitively, if during training a latent variable is of moderate interest for reconstructing the input (in comparison with the other variables), the network will learn to give less importance to it; at the end, the Kullback-Leibler divergence may prevail, pushing the mean towards 0 and the standard deviation towards 1. This will make the latent variable even more noisy, in a vicious loop that will eventually induce the network to completely ignore the latent variable.

We can get some empirical evidence of the previous phenomenon by artificially deteriorating the quality of a specific latent variable.
In Figure 3, we show the evolution during training of one of the active variables of the variational autoencoder in Table 1 subject to a progressive addition of Gaussian noise. During the experiment, we force the variables that were already inactive to remain so, otherwise the network would compensate the deterioration of a new variable by revitalizing one of the dead ones.

Figure 3: Evolution of reconstruction gain and KL-divergence of a latent variable
during training, acting on its quality by addition of Gaussian blur. We also show
in the same picture the evolution of the variance, to compare their progress.

In order to evaluate the contribution of the variable to the loss function we compute the difference between the reconstruction error when the latent variable is zeroed out with respect to the case when it is normally taken into account; we call this information reconstruction gain.

After each increment of the Gaussian noise we repeat one epoch of training, to allow the network to suitably reconfigure itself. In this particular case, the network reacts to the Gaussian noise by enlarging the mean values \mu_x(X) of the posterior distribution Q(z|X), in an attempt to escape from the noisy region, but also jointly increasing the KL-divergence. At some point, the reconstruction gain of the variable is becoming less than the KL-divergence; at this point we stop incrementing the Gaussian blur. Here, we assist to the sparsity phenomenon: the KL-term is suddenly pushing variance towards 1, with the result of decreasing the KL-divergence, but also causing a sudden and catastrophic collapse of the reconstruction gain of the latent variable.

Contrarily to what is frequently believed, sparsity seems to be reversible, at some extent. If we remove noise from the variable, as soon as the network is able to perceive a potentiality in it (that may take several epochs, as evident if Figure 3, it will eventually make a suitable use of it. Of course, we should not expect to recover the original information gain, since the network may have meanwhile learned a different repartition of roles among latent variables.

About the Kullback-Leibler divergence in Variational Autoencoders

22/12/2018

In this article we shall try to provide an intuitive explanation of the Kullback-Leibler component in the objective function of Variational Autoencoders (VAEs). Some preliminary knowledge of VAEs could be required: see e.g. Doersh’s excellent tutorial for an introduction to the topic.

* * *

Variational Autoencoders (VAEs) are a fascinating facet of autoencoders supporting random generation of new data samples. The log-likelihood log(P(X)) of a data sample X is approximated by a term known as evidence lower bound (ELBO), defined as follows
 
Ez∼Q(z|X) log P(X|z) − KL(Q(z|X)||P(z))          (1)
 
where E denotes an expected value and KL(Q||P) is the Kullback-Leibler divergence of Q from P.

You should think of Q(z|X) as an encoder mapping data X in a vector of random variables z, and P(X|z) as a decoder, reconstructing the input given its encoding. P(z) is a prior distribution of latent variables: typically a normal distribution.

The first term in equation (1) is simply a distance of the reconstruction from the original: if P(X|z)  has a Gaussian distribution around some decoder function d(z), its logarithm is a quadratic loss between X and d(zX), where zX is the encoding for X. The interesting point is that, at training time, instead of precisely using zX for reconstructing the input image (as we shall do in a traditional autoencoder), we sample around this point according to the (learned) distribution Q(z|X). Supposing Q(z|X) has a normal shape, learning the distribution means learning its moments: the mean value zX, and the variance σ2(X).

Intuitively, you can imagine the area around zX with dimension σ2(X) as the portion of the latent space able to produce a reconstruction close to the original X: for this reason, we expect σ2(X) to be a small value: we do not want encodings relative to different data overlap each other.

In this video, we describe the trajectories in a binary latent space followed by ten random digits of the MNIST dataset (one for each class) during the first epoch of training. The animation is summarized in Figure 1, where we use a fading effect to describe the evolution in time.

Figure 1: Trajectories followed in a bidimensional latent space
by ten MNIST digits during the first trainin epoch.

For each digit, the area of the circle is the variance computed by the VAE (more precisely, r2 is the geometric average of the two variances along x and y). Initially it is close to 1, but it rapidly decreases to a very small dimension; this is not surprising, since we need to find place for 60000 different digits! Also, observe that all digits progressively distribute around the center, in a Gaussian-like distribution. The Gaussian distribution is better appreciated in Figure 2, where we describe the position and “size” of 60 digits (6 for each class) after 10 training epochs.

Figure 2: position and variance of 60 MNIST digits after 10 epochs of training. Observe the Gaussian-like distribution and, especially, the really small values of the variance

 

Let us also remark the really small values of the variance σ2(X) for all data X (expressed by the area of the circle). Actually, they would further decrease along the prosecution of training.

The puzzling nature of the Kullback-Leibler term

The questions we shall try to answer to are the following:

  1. Why are we interested in learning the variance σ2(X)? Note that this variance is not used during generation of new samples, since in that case we sample from the prior Normal distribution (we do not have any X!). The variance σ2(X) is only used for sampling during training, but even the relevance of such an operation (apart, possibly, for improving the robustness of the generator) is not evident, especially since, as we have seen, σ2(X) is typically very small! As a matter of fact, the main operational purpose of sampling during training is precisely to learn the actual value of σ2(X),  that takes us back to the original question: why are we interested in learning σ2(X)?
  2. The purpose of the Kullback-Leibler component in the objective
    function is to bring the probability of Q(z|X) close to a normal G(0,1) distribution. That sounds crazy: if, for any X, Q(z|X) is normal, we would have no way to distinguish the different inputs in the latent space. In this case too, we may understand that we try to keep the mean value zX close to 0 with some quadratic penalty, in order to achieve the expected Normal distribution of latent variables (needed for generative sampling), but why are we trying to keep σ2 close to 1?  If for a pair of different inputs X’ and X” the corresponding Gaussians Q(z|X’) = G(zX’2(X’)) and Q(z|X”) = G(zX”2(X”)) overlap too much, we would have no practical way to distinguish the two points. The mean values zX’  and zX” cannot be too far away from each other, since we expect them to be normally distributed around 0, so we eventually expect the variance σ2(X) to be really small for any X (close to 0), that is what happens in practice. But if this is the expected behavior, why do we have as part of our learning objective to keep σ2(X) close to 1?

The kind of answers we are looking for are not on a theoretical level:
the mathematics behind variational aoutoencoders is neat, although
not very intuitive. Our purpose is to obtain some empirical evidence that could help us to better grasp the underlying theory.

A closer look at the Kullback-Leibler component

Before addressing the previous questions, let’s have a closer look at the Kullback-Leibler divergence in equation (1). Supposing that Q(z|X) has a Gaussian shape G(zX2(X)) and the prior P(z) is a normal G(0,1) distribution, we can compute it in closed form:

KL(G(zX2(X))||G(0,1)) = 1/2(zX2 + σ2(X) – log(σ2(X)) -1)       (2)

The term zX2 is a quadratic penalty over encodings meant to keep them around 0. The second part σ2(X) – log(σ2(X)) -1 is pushing σ2(X) towards the value 1. Note that by removing this part, sampling during training would loose any interest: if σ2(X) is not contrasted in any way, it would tend to  go to 0: the distribution Q(z|X) would collapse to a Dirac distribution around zX, making sampling pointless.

So, mathematically, sampling at training time allows us to estimate σ2(X), and we are interested in computing σ2(X) because of its usage in equation (2), where we try to contrast the natural tendency of σ2(X) to collapse to 0 by pushing it towards 1. What is still to be understood is the actual purpose of this operation.

Take up as much space as you deserve

The practical purpose of the previous mechanism is to induce each data point X to take into the latent space as much space as it deserves, compatibly with the similar requirement of other points. How much space can it take? In principle, all the available space, that is precisely the reason why we try to keep the variance σ2(X) close to 1.

In practice, you force each data point X to compete with all other points in the occupancy of the latent space, each one pushing the others as far away as possible. This operation should hopefully result into a better coverage of the latent space, that should in turn produce a better generation of new samples.

Let’s try to get some empirical evidence of this fact.

In the case of MNIST, we start getting some significant empirical evidence when considering a sufficiently deep architecture in a latent space of dimension 3 (with 2 dimensions it is difficult to appreciate the difference) In Figure 3, we show the final distribution of 5000 MNIST digits in a 3-dimensional latent space with and without sampling during training (in the case without sampling we keep the quadratic penalty on zX). We also show the result of generative sampling from the latent space, organized in five horizontal slices of 25 points each. For this example we used a dense 784-256-64-16-3 architecture.

Figure 3: Distribution of 5000 MNIST digits in a 3D latent space,
with sampling at training time (left) and without it (right). 

We may observe that sampling during training induces a much more regular disposition of points in the latent space. In turn, this results in a drastic improvement in the quality of randomly generated images.

Does this scale to higher dimensions? Is the Kullback-Leibler component really enough to induce a good coverage of the latent space in view of generative sampling?

Apparently, yes, at a good extent. But in higher dimensions there is an even more interesting effect produced by the Kullaback-leibler divergence, that we shall discuss in the next article: the latent representation is getting sparse!