Comparing Latent Spaces (3): no Space for old men

In a previous post we gave evidence of the possibility to pass from a latent space to another by means of a simple linear map. Moreover, this linear map can be defined by locating in the two spaces a small number of elements, that form what we call a Support Set.

In principle, arbitrary (linearly independent) points could be used, but for robustness reasons it is preferable to choose elements as far as possible from each other. A possible technique to build a good Support Set has been discussed in the second post about this topic.

If carefully choosen, elements in the Support Sets enucleate the main factors of variations of the data manifold, providing a natural and challenging benchmark to test the expressiveness of generative models.

In this post, we apply our CelebA Support Set to investigate the actual strength of StyleGAN, that will reserve us some surprise.

Inverting StyleGAN

First of all, we need to invert StyleGAN, that is, we need to define an encoder mapping the visible domain to the latent one. We work on the so called W space of styleGAN, where the actual decoding starts from.

Inversion can be done by a means of a quite traditional CNN. The good point is that since we can exploit a Generator, there is no risk of overfitting, training the inversion net on a stream of continuously new data.

The results of our StyleGAN-recoder can be seen in the following picture:

Results of our own network for StyleGAN inversion. Images in the first row have been generated by StyleGAN; they are re-coded into the W space and regenerated (second row). The two images are hardly distinguishable.

As you see, results are excellent. Numerically, the mean squared error between generated and re-generated images is around 0.026.

However, when we try to apply inversion to images in our support set (suitably cropped), this is what we get:

StyleGAN inversion on images in the Support Set.
The macro structure (background, pose, illumination, etc.) is preserved, but all other features are lost.
Note also the more “conventional” nature of the images obtained by inversion.

Quite disappointing.

In this case, the mean square error between images in the Support Set and their reconstruction through styleGAN is 0.251, almost ten times higher than the previous value.

The only possible conclusion is that images in the Support Set are outside of the generative range of StyleGAN: styleGAN is simply unable to generate those pictures.

Double check by gradient ascent

To further validate the previous claim, let us try to synthesize images in the support set by means of a gradient ascent technique. Again, if we do it on images generated by StyleGAN we obtain almost perfect results. However, when we try to synthesize images in the support set, this is what we get:

The gradient ascent technique confirms that images in the CelebA support set cannot be generated by StyleGAN.

In terms of sectors, discussed in our previous post, they seem to be external to StyleGAN’s generative range, or in any case severely underrepresented:

CelebA Sectors (in green) seem to be external to the latent space of StyleGAN (in blue)

This can possibly be just a consequence of the fact that StyleGAN is trained over Celeba-HQ, that is a based on a relatively small subset of CelebA. However, we believe that it has much deeper cause.

As a matter of fact, the goal of the generator is just to fool the discriminator. To this aim, the generator can just produce images that are likely to belong to the data manifold, but there is no need for it to mimic its actual distribution. On the contrary, unlikely datapoints with oddities, imperfections, or just elements belonging to underepresented categories will be typically avoided by GAN-generators. This is why GANs have troubles with people with hats, or old men, or people with unusual postures: have you ever seen a GAN generating a person with a hand in her hairs, or in front of her face?

However, even in this case, if we restrict the analysis to the subspace of StyleGAN, it is possible to map it linearly to the latent space of other generators, as it has been shown in our first post of this thread.

Please, refer to Comparing the latent space of generative models for a more articulated discussion of StyleGAN’s generative issues.


Comparing latent spaces (2): Support Set

In our previous post we claimed that we can pass between the latent space of different deep generative models for a given data manifold by means of a simple linear transformation preserving most of the information.

Transformation between latent spaces

If the map is linear, it can be defined in terms of a small set of independent points of the same cardinality of the dimension of the latent spaces; this is what we call a Support Set. Locating these points in the two latent spaces is enough to define the map.

In principle, any set of independent points could serve as a Support Set, but for robustness reasons, it seems preferable to chose points as apart as possible between each other, enucleating the main factors of variations of the data manifold. For this reason, the Support Set is of interest in its own, and in particular it provides a natural and challenging benchmark to test the expressiveness of generative models.

In this post i describe our technique for defining a good support set, based on “sectors” in the latent space.

In the following, we suppose to work with a fixed generative model (any model would do), defined by an encoder Enc and a decoder Dec.

The technique is essentially based on three steps

  1. features ordering
  2. sectors identification
  3. sample selection

Features ordering

We start ordering the latent variables according to their relevance for the reconstruction in the visible domain.

Let o be an object in the visible domain, and let zo be its internal representation. The distance between o and Dec(zo) is the usual reconstruction loss. To evaluate the contribution of a specific variable zi we compare the previous value with what we would obtain in case zi is set to 0. This information, averaged over a large number of data samples o, is what we call the information gain relative to zi.

For our experiments, we used our recent Split-VAE, which has a latent space of 150 variables. In the following Figure we show the information gain relative to all the latent variables, ordered by relevance. As you see, only a bunch of them is really informative.

Information gain for all variables, in
decreasing order

Let’s see the effect of the most informative latent variables in the visible domain: we pick a random latent vector and generate a set of images varying the given variable in the range [-2.25,+2.25].

Effect of the seven most informative latent variables of SVAE in the visible domain

Not surprisingly, most of the variables are associated with a change in luminosity of all or part of the image, possibly associated with modifications in hair and skin colors, source of illumination, tiny variations in the pose and, in some cases, a progressive Female-Male transition.

Sectors identification

The next step consists in defining sectors in the latent space sufficiently apart from each other.

Given a latent variable i, a treshold distance th, and a direction (+,-), the sector (i,th,dir) is the set of points in the latent space where dir*zi > th.

We work with the 7 most informative variables, and consider all possible intersections, that gives us a total of 27 sectors.

The idea is visually explained by the following picture, restricted to three dimensions:

Example of sectors in 3 dims (cropped to distance 2
from the origin).
The distance between sectors is equal to twice a
configurable threshold.

As expected, images (generated from points) in a same sector share macroscopic features like background color, pose, hairs, and illumination. We show a few examples in the following pictures:

samples in sector 102
samples in sector 109
samples in sector 126

The exploration of sectors is a fascinating topic by itself.

It is also interesting to observe that the number of latent points within different sectors at a given threshold is far from uniform, so the density of the latent space seems to be quite irregular. This is particularly problematic in the case of Variational Autoencoders, since it is a symptom of a potential mismatch between the generative prior and the aggregate inference distribution computed by the encoder, which is a well known and problematic aspect of VAEs (we posted a couple of contributions on this argument).

Sample selection

The final step for the creation of the Support Set consists in a random selection of a single image from each sector. This gives us a total of 27 = 128 images (since the latent space of SVAE ha dimension 150, we integrate it with a few images from the most crowded sectors).

The 128 CelebA images in our Support Set are shown in the following pictures:

As you see, at first glance, apart maybe from a few pathological cases, images in the Support Set have nothing really special. In the second set many people have a hat: since it is well known that GANs have problems with hats, we also derived a second “hat free” set, that can be downloaded from our github repository:

In spite of their “ordinary” appearance, samples in the support set occupy “extreme” positions in the latent space with respect to the most informative directions, and are supposedly representative of the principal factors of variations in the dataset.

As a partial confirmation of this fact, we expect the distance between elements in the support set (in the visible domain) to be sensibly higher than the average distance between points in the full dataset. This is actually the case: the mean squared error between random CelebA images is 0.116, versus 0.183 for samples in the support set.

For this reason, the Support Set, in addition to give a robust set of points to map different latent spaces, provides a natural benchmark to test the expressiveness of generative models.

As we shall see in the next post, StyleGAN seems to be in big trouble to generate images in the CelebA Support Set.

Please refer to Comparing the latent space of generative models for more information on the Support Set and a complete list of their CelebA numbers.

Comparing the latent space of generative models

I have been recently working on the problem of comparing the latent space of different generative models, and in particular to the task of looking for transformations between them.

Mapping between the two latent spaces Z_1 and Z_2

Given a generative model, it is usually possible to have an encoder-decoder pair mapping the visible space to the latent one (even GANs are easily inverted). Starting from this assumption, it is always possible to map an internal representation in a space Z_1 to the corresponding internal representation in a different space Z_2 passing through the visible domain. This provides a supervised set of input/output pairs: we can try to learn a direct map, as simple as possible.

The astonishing fact is that a simple linear map gives excellent results, in many situations. This is quite surprising, given that both encoder and decoder functions are modeled by deep, non-linear transformations.

We tested mapping between latent spaces of:

  • different trainings of a same model (Type 1)
  • different generative models in a same class, e.g different VAEs (Type2)
  • generative models with different learning objectives, e.g. GAN vs. VAE (Type3)

In all cases, a linear map is enough to pass from a space to another preserving most of the information.

Some examples are provided below. In the first row we have the original, in the second row the image reconstructed by the first generative model, and in the third row the image obtained by the second model after linear relocation in its space.

Relocation of Type 1, between latent spaces relative to different training instances of the same generative model, in this case a Split Variational Autoencoder. The two reconstructions are almost identical.

Relocation of Type 2, between a Vanilla VAE and a state-of-the-art Split-VAE. The SVAE produces better quality images, even if not necessarily in the direction of the original: the information lost by the VAE during encoding cannot be recovered by the SVAE, which instead makes a reasonable guess.

Relocation of Type 3, between a vanilla GAN and a SVAE. Additional examples involving StyleGAN are given in the article. To map the original image (first row) into the latent space of the GAN we use an inversion network. Details of reconstructions may slightly differ, but colors pose and the overall appearance is surprisingly similar. In some cases (e.g. the first picture) the reconstruction re-generated by the VAE (from the GAN encoding!) is closer to the original than that of the GAN itself.

The StyleGAN case

The comparison with StyleGAN has some interesting additional issues, since we are also changing the training dataset (from CelebA to CelebA-HQ), the resolution (from 64×64 to 1024×1024) and the face-crop (slightly larger in the case of StyleGAN).

Although StyleGAN has some problematic and pathological behaviors, that we shall discuss in a forthcoming post, even in this case we are able to define interesting linear maps between latent spaces (we work on the so called W space of StyleGAN, as it is customary for latent space explorations).

The previous picture show a transformation of Type 3 from the W space of StyleGAN to the latent space of SVAE. In the first row we have sources, sampled by StyleGAN from w ∈ W. In the second row we have the SVAE reconstruction, starting from a suitably cropped and rescaled image (SVAE work at resolution 64). These images are the best possible approximation of the source images obtainable by SVAE. In the third row we show the output produced by the SVAE decoder by directly mapping each w in its latent space: results are very similar to those of the second row.

Even more interesting is the opposite transformation, from SVAE to StyleGAN.

Mapping from the latent space of SVAE to the W space of StyleGAN. In the first row we have images generated by StyleGAN: StyleGAN(w), for w ∈ W. In the second row we have their SVAE-reconstructions, starting from suitably cropped and rescaled versions. Images in the third row are obtained in three steps:

1. encoding StyleGAN(w) in the latent space of the SVAE, obtaining a latent representation z;

2. using a linear map to transform z into a new encoding $w’ ∈ W$

3. using StyleGAN to decode it, i.e. computing StyleGAN(w’) .

Again, StyleGAN(w) and StyleGAN(w’) are astonishingly similar.

Article: The full article is available here (joint work with Valerio Tonelli).

Code: the code is available at the following repository

The large umbrella of Artificial Intelligence

It is becoming increasingly frequent to see introductions to deep learning where the subject is presented as a subtopic of Machine Learning, that is in turn described as a particular field of Artificial Intelligence.

While this presentation can indubitably have some didactic relevance, in particular for explaining what components of a software algorithm the machine is supposed to learn, from a more historical perspective it is quite confusing and it does not help to understand the mutual relations between the different fields and their evolution.

First of all, it is important to understand that Artificial Intelligence has not always been as popular as it is at present but, similarly to most scientific topics, traversed periods of mixed fortunes,  alternating waves of love and disaffection. In particular, there are a couple of exceptionally gloom periods, known as AI-winters, during years 1974–1980 and 1987–1993, the first one following the famous Lighthill’s report, and the second one as a consequence of the failure of the “5th-generation computers” project.

As it is normal, during favourable periods many research fields tend to rally under the highly mediatic banner of AI, while in gloomy days they tend to better mark and confine their territory, making distinctions, and entrenching themselves on specific, highly specialized topics, and clearly identifiable methodologies.

The situation at the beginning of the century

Let’s have a look at the state of the discipline at the beginning of the century.

AI is still slowly recovering from its second winter, also suffering from the emerging (separated) field of Machine Learning (more on it later). Traditional AI is dominated by the historical topics: knowledge representation, expert systems and (constraint) logic programming. Neural Networks are indeed a part of AI, but play an extremely marginal role. We are still in an epoch of shallow networks, with most of the attention focused on recurrent NN and the recent LSTM models; networks are slow, difficult to train, and not particularly effective. The general perception is that NN is a dying topic, with no perspectives. In addition, due to their biological inspiration, they are looked at with particular suspicion from the majority of AI researchers, in view of the on going and somewhat surreal discussion about “strong” versus “weak” AI.

The lively, emerging field is Machine Learning. In this case, the trendy topic is Support Vector Machines (SVM) , followed, at great distance, by Bayesian models. Machine Learning is not interested in presenting itself as a subfield of AI; on the contrary, it tries to emphasize its distinctive methodologies, and the more solid, scientific background. It is instructive to see that in the book “Pattern Recognition and Machine Learning” by Bishop, one of the pillars of the discipline, there is not a single mention of Artificial Intelligence (but for the names of a few conferences in references). In this case too, Neural Networks are a subfield of ML (in Bishop’s book there is a full chapter on these models): they share with ML terminology, methodologies and relevant techniques. However, again, their role is absolutely marginal, and shallow networks are systematically outperformed by different techniques.

AI, ML and (shallow) Neural Networks at the beginning of the century

The two communities of AI and ML ignored (read, cordially detested) each other; moreover, they jointly hardly tolerated people working on Neural Networks, both for the already mentioned reasons, and the guiltless habit of “keeping foot in two worlds”.

A myriad of specialized topics

In addition to the above areas of research, further specialization on particular domains of application – vision, natural language processing, optical character recognition, speech comprehension, data mining, decision theory, robotics, … – contributed to fragment AI into a myriad of sub-fields that had essentially nothing to say to each other, and no interest in exchanging knowledge.

Fragmentation of AI/ML research aound year 2000 (only major fields are reported)

The case of Natural Language Processing (NLP) is paradigmatic. During the sixties, NLP was one of the main areas of application of AI, aggressively supported by government, mostly interested in the potential military applications of this line of research. However, results were modest and in 1964, the National Research Council created a commission – the Automatic Language Processing Advisory Committee (ALPAC) – to investigate the problem. The famous report, delivered in 1966, was extremely negative on the perspective of the field.

While the ALPAC report caused the end of the NRC direct financial support, more money arrived through the Defense Advanced Research Projects Agency (DARPA, previously known as ARPA). Another famous failure is the Speech Understanding Research program of the Carnegie Mellon University (CMU), generating a progressive frustration of DARPA (1971-74) that finally resulted in the cancellation of an annual grant of three million dollars (1974).

After the first winter of AI, the paths of AI and NLP start to diverge, giving rise to distinct communities, with little or no communication between them. In the ’80, NLP is mostly dominated by symbolic methods, but during the period between 1990 and 2010 there is a progressive use of machine learning algorithms, giving rise to the so called Statistical NLP. With the advent of the web at the beginning of the new century, increasing amounts of raw (unannotated) language data are finally available, favouring the development of statistical approaches, and particularly stimulating research on unsupervised and semi-supervised learning algorithms.

The advent of Deep Learning

What changed everything, was the advent of Deep Learning.

Research on deep neural networks has been going on for years, however the starting date of the Deep Learning “Revolution” is usually fixed in 2012, when a number of unexpected and remarkable results focused back the interest of many different communities on Neural Networks. An emblematic event is the ImageNet competition in October 2012, where the so called “Alex” net, a deep convolutional NN by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton won the challenge by a significant margin over shallow machine learning methods. According to many authors, the ImageNet victory in 2012 is a sort of landmark for the new era of “deep learning”.

Since then, Deep Learning techniques rapidly fagocitated many of the fields that had previously departed from AI, imposing itself as a unifying and comprehensive framework.

The Deep Learning Revolution in AI

To make an example, since around 2015, NLP has essentially abandoned statistical methods in favour of Deep Neural Networks. This shift entailed substantial changes in the design of NLP systems, usually characterized by end-to-end learning of high-level tasks, in contrast to the typical pipeline of statistical techniques. For instance, in Neural Machine Translation (NMT) the network is directly trained to learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation.

Computer vision is another field where Deep Learning algorithms have completely revolutionized the state-of-the-art. Object detection, semantic segmentation, face recognition, image denoising, super-resolution, 3D shaping, are just some of the many research areas where deep neural networks have replaced traditional techniques. In many cases, Deep Learning is proposing innovative challenges, as in the case of “panoptic” segmentation, or the key-point detection task for pose estimation.

Robotics too is currently dominated by Deep Reinforcement Learning (DRL) algorithms. In this case, the use of Deep Neural Networks allows to address the scalability problem of most of the traditional RL techniques, avoiding to explicitly model the state space, and using Neural Networks as function approximators for the relevant functions.


In conclusion, the fact of presenting DL as a subfield of ML that is a subfield of AI is somewhat reductive and does not completely reflect the complexity and amplitude of the phenomenon. Deep Learning, being based on neural network, is indeed an historical component of AI (differently from ML). On the other side, its techniques are surely closer to ML than to other fields of AI. In any case, DL is the real novelty of the renewed AI, and its beating heart. When you read news about new AI achievements in newspapers or other media, in the large majority of cases, those results have been obtained by deploying Deep Learning techniques. The diffusion of DL is pervasive, and for the moment the trend shows no signs of slowing down.

Loss of Variance in Variational Autoencoders

Variational Autoencoders (VAEs), in comparison with alternative generative techniques, usually produce images with a characteristic and annoying blurriness.

A possible explanation can be given in terms of an alternative problem, that really seems to be the opposite facet of the same phenomenon: the variance of generated data is significantly lower than that of training data.

This problem of VAEs is not particularly known, that is precisely the reason why I decided to write a post on this argument.

First of all, what variance are we interested in?

Typically, the output of the network is a tensor with shape B\times D where B is the batchsize and D is the dimension of the output. To fix the ideas, let us suppose that D is a coloured image, so that D=W\times H\times 3. We want to measure the variance of each single pixel along the batchsize axis, that is how the pixel varies on the data under consideration. Then, we compute the mean of all variances we obtained in this way.

Working with numpy, the formula we are interested in is

np.mean(np.var(x, axis=0))

Now, we compute this expression on a large number of real data (e.g. on the validation set), on an equal number of reconstructed (or generated) images, and take the difference between them:

\begin{array}{rl}\Delta_{var} = &np.mean(np.var(x, axis=0))\\ & -np.mean(np.var(\hat{x}, axis=0))\end{array}

You can take your favourite VAE and make the above simple computation. We would expect the two values to be approximately equal. Especially, there would be no reason to expect that the variance of one of the two sets should be higher or lower than the other: the difference could be positive as well as negative.

But this is not the case. As you will discover, the difference is positive and not negligeable, that precisely means that we are testifying a loss of variance in generated data.

The entity of the variance loss

Can we estimate the amount of this loss? Experimentally, it turns out to be approximately equal to the mean squared error between real and reconstructed images:

mse = np.mean(np.square(x - \hat{x}))

Note that, contrarily to the previous formula, mse is positive by definition.

To make a working example, consider the simple variational autoencoder for Mnist defined by F.Choillet in this post: Building autoencoders in keras.

In this case (after 50 epochs of training), we have

\begin{array}{rl} mse = & 0.0413 \\ \Delta_{var} = & 0.0411 \end{array}

A lucky coincidence, you might say. Well, quoting master Kenobi, let me say that, in my experience, there is no such thing as luck.

Since I conjectured the above relation, I systematically measured it on all the VAEs I had the oppurtunity to work with. In a couple of years I collected a quite large number of data, summarized in the Figure below.

Relation between mean squared error and variance loss.
The distribution is close to the diagonal.

The different colors refer to different neural architectures: blue = Dense Networks; red = ResNet-like; green = Convolutional Networks; yellow = Iterative Networks (DRAW, GQN-like). The point is not to emphasize the difference between the various architectures (that would require a more extensive evaluation), but to remark that the relation between \Delta_{vae} and mse is systematic, and largely independent from the specific architecture.

Of course, if the variance loss is of the same magnitude of the mean squared error, reducing the reconstruction error may also implicitly solve the variance issue. However, loglikelihood, in conjunction with the underlying averaging due to dimensionality reduction and the variational approach, really seems to be the cause of the loss, so merely trying to improve reconstruction could not be the right approach.

Intuitively, VAEs are, by their nature, extremely prudent in their predictions (at the very opposite of GANs). Even when generated samples are really good, as in the case of the recent NVAE, they still have a patinated look, and an extremely conventional appearance.

What we would like is to find a way to force a bit more of “verve” in their behaviour, increasing the “temperature” of generated samples.

If you are interested, you may find additional details on this topic in this article:

Andrea Asperti, Variance Loss in Variational Autoencoders. Sixth International Conference on Machine Learning, Optimization, and Data Science. July 19-23, 2020 – Certosa di Pontignano, Siena, Italy

Calibrating VAEs

I recently obtained the best generative results ever achieved for Variational Autoencoders.

Below is a sample of the kind of faces I was able to generate using CelebA as training set, with a moderate use of computing resources:

The result is based on two different insights, that I will briefly discuss in the following sections:

  • a new balancing policy between reconstruction error and kullback-leibler divergence in the VAE loss function
  • a renormalization operation for two stage VAEs, compensating the loss of variance typical of (Variational) autoencoders.

Balancing reconstruction error and KL-divergence

The loss function of Variational autoencoders has the following shape:

\underbrace{\mathbb{E}_{z \sim Q(z|X)}log(P(X|z))}_{\mbox{reconstruction error}} + \underbrace{KL(Q(z|X)||P(z))}_{\mbox{KL-divergence}}

The first component is meant to enhance the quality of reconstructed images, while the second component is acting as a regularizer of the latent space, pushing it toward the prior distribution P(Z). The two components have contrasting effects, and their balance is a delicate issue.

Our solution consists in keeping a constant balance along training. Supposing P(X|z) has a Gaussian shape, the log-likelihood is just the mean squared error between the reconstructed image \hat{X} and the original sample X. During training, we normalize this component by an estimation of the current error, computed over minibatches. In this way, the KL component cannot easily prevail, a fact that would forbid any further improvement in the quality of reconstructions.

More details can be found in this article:
Andrea Asperti, Matteo Trentin, Balancing recosntruction error and Kullback-Leibler divergence in Variational Autoencoders, IEEEAccess , vol. 8, pp. 199440-199448, 2020, doi: 10.1109/ACCESS.2020.3034828.

Variance Loss and renormalization

Data Generated by variational auotencoders seem to suffer from a systematic loss of variance with respect to the training set. The phenomenon is likely due to averaging, that in turn could be caused by dimensionality reduction (as in the case of PCA), or by sampling in the latent space, typical of the variational approach.

This is relevant, since generative models are evaluated with metrics such as the Frechet InceptionDistance (FID) that precisely compare the distributions of (features of) real versus generated images.

The variance loss becomes particularly dangerous in a two stage setting, where a second VAE is used to sample in the latent space of the first VAE. The reduced variance creates a mismatch between the actual distribution of latent variables and those generated by the second VAE, that hinders the beneficial effects of the second stage.

A simple solution is just to renormalize the output of the second VAE towards the expected normal spherical distribution (or better towards the moments of this distribution as e.g. computed by the variance law discussed in one of my previous posts).

This simple operation typically results in a sensible improvement in the perceptual quality of generated images, and a remarkable burst in terms of FID.

In the image below, you see the effect of our technique on images generated from a same seed: on the left you have the original image, and on the right the result of the latent space renormalization.

More details can be found in this article:
Andrea Asperti Variance Loss in Variational Autoencoders, arXiv preprint arXiv:2002.09860 (2020).

A stationary condition for Variational Autoencoders


In this article, we continue our investigation of Variational Autoencoders (see our previous posts on the regularization effect of the Kullback-Leibler divergence, and the sparsity phenomenon). In particular, we shall point out an interesting stationary condition induced by the Kullback-Leibler component of the objective function.

Let us first of all observe that trying to compute relevant statistics for the posterior distribution Q(z|X) of latent variables without some kind of regularization constraint does not make much sense. As a matter of fact, given a network with mean \mu_z(X) and variance \sigma_z^2(X) for a given latent variable z, we can easily build another one having precisely the same behavior by scaling mean and standard deviation by some constant γ (for all data, uniformly), and then downscaling the generated samples in the next layer of the network. This kind of linear transformations are easily performed by any neural network (it is the same reason why it does not make much sense to add a batch-normalization layer before a linear layer).

Let’s see how the KL-divergence helps to choose a solution. In the following, we suppose to work on a specific latent variable z, omitting to specify it. Starting from the assumption that for a network it is easy to keep a fixed ratio \rho^2(X) = \sigma^2(X)/\mu^2(X), we can push this value in the closed form of the Kullback-Leibler divergence, that is

{\mathit{KL}(G(\mu(X),\sigma^2(X))||G(0,1))= (\mu(X)^2+\sigma^2(X) -\mathit{log}(\sigma^2(X)) -1) /2}

getting the following expression:

(\sigma^2(X)\frac{1 + \rho^2(X)}{\rho^2(X)}-log(\sigma^2(X))-1)/2  \hspace{1cm}(1)

In Figure 1, we plot the previous function in terms of the variance, for a few given values of \rho.

KL-divergence for different values of \rho:
observe the strong minimum for small values of \rho.

The above function has a minimum for

{\sigma^2(X) = \frac{\rho^2(X)}{1+\rho^2(X)}}\hspace{1cm}(2)

close to 0 when \rho is small, and close to 1 when \rho is high. Of course \rho depends on X, while the rescaling operation after sampling must be uniform, still the network will have a propensity to synthesize standard variations close to 0 \le \frac{\rho^2(X)}{1+\rho^2(X)} <1 (below we shall average on all X).

Substituting the definition of \rho^2(X) in equation (2), we expect to reach a minimum when \sigma^2(X) = \frac{\sigma^2(X)}{\mu^2(X)}\frac{\mu^2(X)}{\mu^2(X)+\sigma^2(X)}, that, by trivial computations, implies the following simple stationary condition:

\sigma^2(X) + \mu^2(X) = 1

Let us now average together the KL components for all data X:

\frac{1}{N}\sum_X \frac{1}{2}(\mu(X)^2 + \sigma^2(X)-log(\sigma^2(X)) -1)

We use the notation \widehat{f(X)} to abbreviate the average \frac{1}{N}\sum_X f(X) of f(X) on all data X.
The ratio \rho^2 = \frac{\widehat{\sigma^2(X)}}{\widehat{\mu^2(X)}} can really (and easily) be kept constant by the net. Let us also observe that, assuming the mean of the latent variable to be 0, \widehat{\mu^2(X)} is just the (global) variance \sigma^2 of the latent variable.

Pushing \rho^2 in the previous equation, we get

{\frac{1}{2}(\widehat{\sigma^2(X)}\frac{1 + \rho^2}{\rho^2}-\widehat{log(\sigma^2(X))} -1)}

Now we perform a somewhat rough approximation. The average of the logarithms \widehat{log(\sigma^2(X))} is the logarithm of the geometric mean of the variances. If we replace the geometric mean with an arithmetic mean, we get an expression essentially equivalent to expression (1), namely

{\frac{1}{2}(\widehat{\sigma^2(X)}\frac{1 + \rho^2}{\rho^2}-log(\widehat{\sigma^2(X)}) -1)}

that has a minimum when

\widehat{\sigma^2(X)} = \frac{\rho^2}{1+\rho^2}

that implies

\widehat{\sigma^2(X)} + \widehat{\mu^2(X)} = 1

or simply,

\widehat{\sigma^2(X)} + \sigma^2 = 1\hspace{1cm}(3)

where we replaced \widehat{\mu^2(X)} with the variance \sigma^2 of the latent variable in view of the consideration above.

Condition (3) can be experimentally verified. In spite of the rough approximation we did to get it, it proves to be quite accurate, provided the Variational Autoencoder is sufficiently trained. You can check it on your own experiments, or compare it with the data provided in our previous posts.

Let us finally remark that condition (3) is supposed to hold both for active and inactive variables.

Sparsity in Variational Autoencoders

In our previous article, we discussed the regularization effect of the Kullback-Leibler divergence in the objective function of Variational Autoencoders, providing empirical evidence that it results in a better coverage of the latent space.

In this article, we shall discuss another important effect of it: working in latent spaces of sufficiently high-dimension, the latent representation becomes sparse. Many latent variables are zeroed-out (independently from the input), the associated variance computed by the network is around one (while the real variance is close to 0), and in any case those variables are neglected by the decoder.

This property is usually known under the name of over-pruning, since it induces the model to only use small number of its stochastic units. In fact, this is a form of sparsity, with all the benefits typically associated with this form of regularization.

* * *

Sparsity is a well known and desirable property of encodings: it forces the model to focus on the relevant features of data, usually resulting in more robust representations, less prone to overfitting. 

Sparsity is typically achieved in neural networks by means of weight-decay L1 regularizers, directly acting on weights. Remarkably, the same behaviour is induced in Variational Autoencoders by the Kullback-Leibler divergence, simply acting on the variance of the encoding distribution Q(z|X).

The most interesting consequence is that, at least for a given architecture, there seems to exist an intrinsic internal dimension of data. This property can be exploited both to understand if the network has sufficient internal capacity, augmenting it to attain sparsity,  or conversely to reduce the dimension of the network removing links to unused neurons. Sparsity also help to explain a loss of variability in random generative sampling from the latent space one may sometimes observe with variational autoencoders

In the following sections we shall investigate sparsity for a couple of typical test cases.


We start with the well known MNIST dataset of handwritten digits that we already used in our previous article. Our first architecture is a dense network with dimensions 784-256-64-32-16.

In Figure 1 we show the evolution during a typical training of the variance of the 16 latent variables.

Figure 1: evolution of the variance along training (16 variables, MNIST
case).  On the x-axis we have numbers of minibatches, each one of size 128. 
The variance of some variables goes to zero (as expected), but for other variables it goes to 1, instead. The latter variables will be neglected by the decoder.

Table 1 provides relevant statistics for each latent variable at the end of training, computed over the full dataset: the mean of its variance (that we expect to be around 1, since it should be normally distributed), and the mean of the computed variance σ2(X) (that we expect to be a small value, close to 0). The mean value is around 0 as expected, and we do not report it.

no variance mean(σ2(X))
0 8.847272e-05 0.9802212
Table 1: inactive variables in the 784-256-64-32-16 VAE for MNIST digits

All variables highlighted in red have an anomalous behavior: their variance is very low (in practice, they always have value 0), while the variance σ2(X) computed by the network is around 1 for each X. In other words, the representation is getting sparse! Only 8 latent variables out of 16 are in use: the other ones are completely ignored by the generator.

As an additional confirmation of this fact, in Figure 2 we show a few digits randomly generated from Gaussian sampling in the latent space (upper line) and the result of generation when inactive latent variables have been zeroed-out (lower line): they are indistinguishable.

Figure 2: Upper line: digits generated from a vector of 16 normally sampled latent variables. Lower line: digits generated after “red” variables have been zeroed-out;
these latent variables are completely neglected by the generator

Let’s try with a convolutional VAE. We consider a relatively sophisticated network, with the following structure (see Figure 3)

Figure 3: arcbitecure of the convolutional VAE

In this case, sparsity is less evident, but still present: 3 variables out of 16 are inactive.

no variance mean(σ2(X))
Table 1: inactive variables in the conv-VAE for MNIST digits

Having less sparsity seems to suggest that convolutional networks make a better exploitation of latent variables, typically resulting in a more precise reconstruction and improved generative sampling.
This is likely due to the fact that latent variables encode information corresponding to different portions of the input space, and are less likely to become useless for the generator.

The sparsity phenomenon was first highlighted by Burda et al. in this work, and later confirmed by many authors on several different datasets and neural architectures. We discuss this debated topic and survey the recent literature in this article, where we also investigate a few more concrete examples.

The degree of sparsity may slightly vary from training to training, but not in a significant way (at least, for similar final values of the loss function). This seems to suggest that, given a certain neural architecture, there exists an intrinsic, “optimal” compression of data in the latent space. If the network does not exhibit sparsity, it is probably a good idea to augment the dimension of the latent space; conversely, if the network is sparse we may reduce its dimension removing inactive latent variables and their connections

Kullback-Leibler divergence and sparsity

Let us consider the loglikelihood for data X:

Ez∼Q(z|X) log P(X|z) − KL(Q(z|X)||P(z))

If we remove the Kullback-Leibler component from previous objective function, or just keep the quadratic penalty on latent variables, the sparsity phenomenon disappears. So, sparsity must be related to that part of the loss function.

It is also evident that if the generator ignores a latent variable, P(X|z) will not depend on it and the loglikelihood is maximal when the distribution of Q(z|X) is equal to the prior distribution P(z), that is just a normal distribution with 0 mean and standard deviation 1. In other words, the generator is induced to learn a trivial encoding zX = 0 and a (fake) variance σ2(X) =1. Sampling has no effect, since the sampled value for zX will just be ignored.

Intuitively, if during training a latent variable is of moderate interest for reconstructing the input (in comparison with the other variables), the network will learn to give less importance to it; at the end, the Kullback-Leibler divergence may prevail, pushing the mean towards 0 and the standard deviation towards 1. This will make the latent variable even more noisy, in a vicious loop that will eventually induce the network to completely ignore the latent variable.

We can get some empirical evidence of the previous phenomenon by artificially deteriorating the quality of a specific latent variable.
In Figure 3, we show the evolution during training of one of the active variables of the variational autoencoder in Table 1 subject to a progressive addition of Gaussian noise. During the experiment, we force the variables that were already inactive to remain so, otherwise the network would compensate the deterioration of a new variable by revitalizing one of the dead ones.

Figure 3: Evolution of reconstruction gain and KL-divergence of a latent variable
during training, acting on its quality by addition of Gaussian blur. We also show
in the same picture the evolution of the variance, to compare their progress.

In order to evaluate the contribution of the variable to the loss function we compute the difference between the reconstruction error when the latent variable is zeroed out with respect to the case when it is normally taken into account; we call this information reconstruction gain.

After each increment of the Gaussian noise we repeat one epoch of training, to allow the network to suitably reconfigure itself. In this particular case, the network reacts to the Gaussian noise by enlarging the mean values \mu_x(X) of the posterior distribution Q(z|X), in an attempt to escape from the noisy region, but also jointly increasing the KL-divergence. At some point, the reconstruction gain of the variable is becoming less than the KL-divergence; at this point we stop incrementing the Gaussian blur. Here, we assist to the sparsity phenomenon: the KL-term is suddenly pushing variance towards 1, with the result of decreasing the KL-divergence, but also causing a sudden and catastrophic collapse of the reconstruction gain of the latent variable.

Contrarily to what is frequently believed, sparsity seems to be reversible, at some extent. If we remove noise from the variable, as soon as the network is able to perceive a potentiality in it (that may take several epochs, as evident if Figure 3, it will eventually make a suitable use of it. Of course, we should not expect to recover the original information gain, since the network may have meanwhile learned a different repartition of roles among latent variables.

About the Kullback-Leibler divergence in Variational Autoencoders


In this article we shall try to provide an intuitive explanation of the Kullback-Leibler component in the objective function of Variational Autoencoders (VAEs). Some preliminary knowledge of VAEs could be required: see e.g. Doersh’s excellent tutorial for an introduction to the topic.

* * *

Variational Autoencoders (VAEs) are a fascinating facet of autoencoders supporting random generation of new data samples. The log-likelihood log(P(X)) of a data sample X is approximated by a term known as evidence lower bound (ELBO), defined as follows
Ez∼Q(z|X) log P(X|z) − KL(Q(z|X)||P(z))          (1)
where E denotes an expected value and KL(Q||P) is the Kullback-Leibler divergence of Q from P.

You should think of Q(z|X) as an encoder mapping data X in a vector of random variables z, and P(X|z) as a decoder, reconstructing the input given its encoding. P(z) is a prior distribution of latent variables: typically a normal distribution.

The first term in equation (1) is simply a distance of the reconstruction from the original: if P(X|z)  has a Gaussian distribution around some decoder function d(z), its logarithm is a quadratic loss between X and d(zX), where zX is the encoding for X. The interesting point is that, at training time, instead of precisely using zX for reconstructing the input image (as we shall do in a traditional autoencoder), we sample around this point according to the (learned) distribution Q(z|X). Supposing Q(z|X) has a normal shape, learning the distribution means learning its moments: the mean value zX, and the variance σ2(X).

Intuitively, you can imagine the area around zX with dimension σ2(X) as the portion of the latent space able to produce a reconstruction close to the original X: for this reason, we expect σ2(X) to be a small value: we do not want encodings relative to different data overlap each other.

In this video, we describe the trajectories in a binary latent space followed by ten random digits of the MNIST dataset (one for each class) during the first epoch of training. The animation is summarized in Figure 1, where we use a fading effect to describe the evolution in time.

Figure 1: Trajectories followed in a bidimensional latent space
by ten MNIST digits during the first trainin epoch.

For each digit, the area of the circle is the variance computed by the VAE (more precisely, r2 is the geometric average of the two variances along x and y). Initially it is close to 1, but it rapidly decreases to a very small dimension; this is not surprising, since we need to find place for 60000 different digits! Also, observe that all digits progressively distribute around the center, in a Gaussian-like distribution. The Gaussian distribution is better appreciated in Figure 2, where we describe the position and “size” of 60 digits (6 for each class) after 10 training epochs.

Figure 2: position and variance of 60 MNIST digits after 10 epochs of training. Observe the Gaussian-like distribution and, especially, the really small values of the variance


Let us also remark the really small values of the variance σ2(X) for all data X (expressed by the area of the circle). Actually, they would further decrease along the prosecution of training.

The puzzling nature of the Kullback-Leibler term

The questions we shall try to answer to are the following:

  1. Why are we interested in learning the variance σ2(X)? Note that this variance is not used during generation of new samples, since in that case we sample from the prior Normal distribution (we do not have any X!). The variance σ2(X) is only used for sampling during training, but even the relevance of such an operation (apart, possibly, for improving the robustness of the generator) is not evident, especially since, as we have seen, σ2(X) is typically very small! As a matter of fact, the main operational purpose of sampling during training is precisely to learn the actual value of σ2(X),  that takes us back to the original question: why are we interested in learning σ2(X)?
  2. The purpose of the Kullback-Leibler component in the objective
    function is to bring the probability of Q(z|X) close to a normal G(0,1) distribution. That sounds crazy: if, for any X, Q(z|X) is normal, we would have no way to distinguish the different inputs in the latent space. In this case too, we may understand that we try to keep the mean value zX close to 0 with some quadratic penalty, in order to achieve the expected Normal distribution of latent variables (needed for generative sampling), but why are we trying to keep σ2 close to 1?  If for a pair of different inputs X’ and X” the corresponding Gaussians Q(z|X’) = G(zX’2(X’)) and Q(z|X”) = G(zX”2(X”)) overlap too much, we would have no practical way to distinguish the two points. The mean values zX’  and zX” cannot be too far away from each other, since we expect them to be normally distributed around 0, so we eventually expect the variance σ2(X) to be really small for any X (close to 0), that is what happens in practice. But if this is the expected behavior, why do we have as part of our learning objective to keep σ2(X) close to 1?

The kind of answers we are looking for are not on a theoretical level:
the mathematics behind variational aoutoencoders is neat, although
not very intuitive. Our purpose is to obtain some empirical evidence that could help us to better grasp the underlying theory.

A closer look at the Kullback-Leibler component

Before addressing the previous questions, let’s have a closer look at the Kullback-Leibler divergence in equation (1). Supposing that Q(z|X) has a Gaussian shape G(zX2(X)) and the prior P(z) is a normal G(0,1) distribution, we can compute it in closed form:

KL(G(zX2(X))||G(0,1)) = 1/2(zX2 + σ2(X) – log(σ2(X)) -1)       (2)

The term zX2 is a quadratic penalty over encodings meant to keep them around 0. The second part σ2(X) – log(σ2(X)) -1 is pushing σ2(X) towards the value 1. Note that by removing this part, sampling during training would loose any interest: if σ2(X) is not contrasted in any way, it would tend to  go to 0: the distribution Q(z|X) would collapse to a Dirac distribution around zX, making sampling pointless.

So, mathematically, sampling at training time allows us to estimate σ2(X), and we are interested in computing σ2(X) because of its usage in equation (2), where we try to contrast the natural tendency of σ2(X) to collapse to 0 by pushing it towards 1. What is still to be understood is the actual purpose of this operation.

Take up as much space as you deserve

The practical purpose of the previous mechanism is to induce each data point X to take into the latent space as much space as it deserves, compatibly with the similar requirement of other points. How much space can it take? In principle, all the available space, that is precisely the reason why we try to keep the variance σ2(X) close to 1.

In practice, you force each data point X to compete with all other points in the occupancy of the latent space, each one pushing the others as far away as possible. This operation should hopefully result into a better coverage of the latent space, that should in turn produce a better generation of new samples.

Let’s try to get some empirical evidence of this fact.

In the case of MNIST, we start getting some significant empirical evidence when considering a sufficiently deep architecture in a latent space of dimension 3 (with 2 dimensions it is difficult to appreciate the difference) In Figure 3, we show the final distribution of 5000 MNIST digits in a 3-dimensional latent space with and without sampling during training (in the case without sampling we keep the quadratic penalty on zX). We also show the result of generative sampling from the latent space, organized in five horizontal slices of 25 points each. For this example we used a dense 784-256-64-16-3 architecture.

Figure 3: Distribution of 5000 MNIST digits in a 3D latent space,
with sampling at training time (left) and without it (right). 

We may observe that sampling during training induces a much more regular disposition of points in the latent space. In turn, this results in a drastic improvement in the quality of randomly generated images.

Does this scale to higher dimensions? Is the Kullback-Leibler component really enough to induce a good coverage of the latent space in view of generative sampling?

Apparently, yes, at a good extent. But in higher dimensions there is an even more interesting effect produced by the Kullaback-leibler divergence, that we shall discuss in the next article: the latent representation is getting sparse!